A proposal to enhance RISC-V HPM (Hardware Performance Monitor)


Andy Glew Si5
 

I have NOT been working on a RISC-V performance monitoring proposal, but I've been involved with performance monitoring first as a user then is an architect for many years and at several companies.

I would like to draw this group's attention to some features of Intel x86 performance monitoring  that turned out pretty successful.

First, you're already talking about hardware performance monitoring interrupts for statistical profiling. Good. A few comments on that below.

But, I think one of the best bang for buck performance monitoring features of  Sadie six MIN/performance monitoringis the performance counter event filtering

---+ Performance event filtering and transformation logic before counting

Next, I think one of the best bang for buck performance monitoring features of  x86 EMON performance monitoring is the performance counter event filtering. RISC-V has only the most primitive version of this.

Every x86 performance counter has per counter event select logic.

In addition to that logic, there is a mask that specifies what modes to cvount in - User, OS, hypervisor.  I see that some of the Ri5 proposals also have that. Good.

But more filtering is also provided:  

Each counter  has a "Counter Mask" CMASK - really, a threshold. When non-zero, this is compared to the count of the selected event in any given cycle. If >= CMASK, the counter is incremented by 1; if less, no increment.

=> This comparison allows a HW event to be used to profile things like "Number of cycles in which 1, 2 or more, 3 or more ... events happened - e.g. how often you are able to to acheive superscalar execution.  In vectors, it might count how many vector elements are masked in or out.   If you have events that correspond to buffer occupancy, you can profile to see where the buffer is full or not.

INV - a bit that allows the CMASK comparison to be inverted.

=> so that you can count event > threshold, and event < threshold.

I would really have liked to have >, <, and == threshold.  And also the ability to increment by 1 if exceeding threshold, or by the actual count that exceeds the threshold. The former allows you to find where you are getting good superscalar behavior or not, the latter allows you to determine the average when exceeding the threshold or not. When I do find this I had to save hardware.

This masked comparison allows you to get more different types of events, for events that occur more than one per cycle. That's pretty good, abut it doesn't help you with scarce events, t events that only occur once every four or eight or an cycles.  Later, Intel added what I call  "push-out"  profiling: when the comparison condition is met,  e.g. when no instruction retires, a counter that increments one every clock cycle starts ; when the condition changes, the value of that counter is what is recorded, and naturally subject to all of the natural filtering.  That was too much hardware for me to add in 1991, but it proved very useful.

My first instinct is always to minimize hardware cost for performance monitoring hardware.   The nice thing about the filtering logic at the performance counters is that it removed the hardware cost from the individual unit like the cache,
and left it in the performance monitoring unit.  (The question of centralized versus decentralized performance counters is always an issue.  Suffice it to say that Intel P6 had for centralized performance counters, to save hardware;
Pentium 4 went to a fully distributed performance counter architecture, but users hated it, so until return to the centralized model at least architecturally, although microarchitecture implementations might be decentralized )

More filtering logic: each count has an E, edge select bit.  This counts when the condition described by the privilege level mask and CMASK comparison changes.    Using such edge filtering, you can determine the average length of bursts, e.g. the average length of a period that you have not been able to execute any instruction, and so on.   Simple filters can give you average lengths and occupancies; fancier and more expensive stuff is necessary to actually determine a distribution.   

Certain events themselves are sent to the performance counters as bitmasks.  E.g. the utilization of the execution unit ports as a bitmask - on the original P6 { ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on modern machines.  By controlling the UMASK field of the filter control logic for each performance counter, you could specify to count all instruction dispatches, or just loads, and so on.   Changing the UMASK field allowed you to profile to find out which parts of the machine were being used and which not.   (This proved successful enough to get me and the software guy who eventually started using it in achievement award.)

If I were to do it over again I would have a generic POPCNT as part of the filter logic, as well as the comparison.

Finally, simple filter stuff:

INT - Performance interrupt enable

PC - pin control - this predated me: it toggled an external pin when the performance counter overflowed.  The very first EMON event sampling took that external pin and wired it back to the NMI pin of the CPU.  Obviously, it is better to have internal logic for performance monitoring interrupts. Nevertheless, there is still a need for externally visible performance event sampling, e.g. freeze performance events outside the CPU, in the fabric, or in I/O devices.  Exactly what those are is obviously implementation dependent, but it's still good to have a standard way of controlling such implementation dependent features. I call this the "pin architecture", and IMHO maintaining such system compatibility was as much a factor in Intel's success as instruction set architecture.

---+ Performance counter freeze

There are always several performance counters. At least two per privilege level. At least a pair, so you can compute things like cashless rates and other ratios.     But not necessarily dedicated to any specific privilege level, because that would be wasteful: you can study things a hell of a lot more quickly if you can use all of the performance counters, when other modes are not using them.

When you have several performance counters, you are often measuring things together. You therefore need the ability to freeze them all at the same time. This means that you need to have all of the enable bits for all of the counters, or at least a subset, in the same CSR.   If you want, the enable bit can be in both the per counter control registers and in a central CSR - i.e. there can be multiple views of the same bit indifferent CSRs.

Performance analysts would really like the ability to freeze multiple CPUs performance counters at the same time. This is one motivation for that pin control signal

---+ Precise performance monitoring inputs - You can only wish!

When you are doing performance counter event interrupt based sampling, it would be really nice if the interrupt occurred exactly at the instruction that had the event.

If you can do that, great. However, it takes extra hardware to do that. Also, some events do not in any way correspond to a retired instruction - think events that occur on speculative instructions that never graduate/retire. Again, you can create special registers that record, say, the program counter of such a speculative instruction, but that is again extra hardware.

IMHO there is zero chance that all implementations, particularly the lowest cost of limitations, will make all performance events precise.

At the very least there should be a way of discovering whether an event is precise or not.   

Machine check architectures have the same issue.


---+ What should the priority of the hardware performance monitoring interrupts be?

One of the best things I did for Intel was punt on this issue: because I was also the architect in charge of the APIC interrupt controller, I provided a special LVT interrupt register just for the performance monitoring interrupt.

This allowed the performance monitoring interrupt to use all of the APIC features, all of those that made sense. For example, the performance monitoring interrupt that Linux uses is just the normal interrupt of some priority.  But as I mentioned above, the very first usage of performance monitoring interrupts used NMI, and was therefore able to profile code that had interrupts blocked.   The interrupt would be directed to SMM, the not very good baby virtual machine monitor, allowing performance monitoring to be done independent of the operating system. Very nice, when you don't have source code for the operating system. And so on. I can't remember, but it is possible that the interrupt could be directed to other processors other than the local processor.  However, that would not subsume the externally visible pin control, because the hardware pin can be a lot less expensive a lot more precise than signaling and enter processor interrupt.

I used a similar approach for machine check interrupts, which could also be directed to the operating system, NMI, SMM, hypervisor,…

By the way: I think Greg Favor said that x86 is performance monitoring interrupts are  level sensitive. That is not strictly speaking true: whether they are level sensitive or not is programmed into the AIPAC local vector table. You can make it all sensitive or edge triggered.

Obviously, however, when there are multiple performance counters bearing the same interrupt, you need to know which counter overflowed. Hence the sticky bits that Greg noticed in the manuals.


---+ Fancier stuff

The above is mostly about getting the most out of simple performance counters:  providing filter logic so that you can get the most insight out of the limited number of events;
providing enables in a central place so that you can freeze multiple counters at the same time;  allowing the performance counter interrupts to be directed not just a different privilege levels but in different interrupt priorities including NMI,  and possibly also external hardware.

There's a lot more stuff that can be done to help performance monitoring. Unfortunately, I have always worked at a place where I had to reduce  the performance monitoring hardware cost as much as possible.   I am sure, however, that many of you are familiar with fancier performance monitoring features, such as

+ Histogram counting (allowing you to count distributions without making multiple runs)
    => the CMASK comparators allowing very simple form of this, assuming you have enough performance counters. Actual histogram counters can do this more cheaply.

+ Cycle attribution - defining performance events so that you can actually say things like  "X% of cycles are spent waitying for memory".   

IMHO the single most important "advanced"  performance monitoring feature is what I call "longitudinal profiling".   AMD Instruction Based Sampling (IBS), DEC ProfileMe, ARM SPE (Statistical Profiling Extension).  The basic idea is to set a bit on some randomly selected instruction package somewhere high up in the pipeline, e.g. yet instruction fetch, and then let that bit flow down the pipeline, sampling things as it goes. E.g. you might sample the past missed latency or address,  or whether it produced a stall in interaction with a different marked instruction. This sort of profiling is quite expensive, e.g. requiring a bit in many places in the pipeline, as well as registers to record the sample data, but it provides a lot of insight: it can give you distributions and averages, it can tell you what interactions between instructions are causing problems.

However, if RISC-V cannot yet afford to do longitudinal profiling, the performance counter filter logic that I described above is low hanging fruit, much cheaper.




From: Alankao <alankao@...>
Sent: Monday, July 20, 2020 5:43PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail client, so please forgive my spam.

Hi Brian,

> I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies
> each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

> Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different
> register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

> Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its
> purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".


Andy Glew Si5
 

BTW, I have absolutely no idea  what changes would be necessary to the RISC-V HPM CSRs.

I do note however that the x86 control bits pretty much lived in one CSR per counter.    later versions added a few central counters for things like having all of the enable bits in one place you can get global freeze.


From: Andy Glew Si5 Via Lists.riscv.org <andy.glew=sifive.com@...>
Sent: Tuesday, July 21, 2020 3:53PM
To: Alankao <alankao@...>, Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/21/2020 3:53 PM, Andy Glew Si5 via lists.riscv.org wrote:
I have NOT been working on a RISC-V performance monitoring proposal, but I've been involved with performance monitoring first as a user then is an architect for many years and at several companies.

I would like to draw this group's attention to some features of Intel x86 performance monitoring  that turned out pretty successful.

First, you're already talking about hardware performance monitoring interrupts for statistical profiling. Good. A few comments on that below.

But, I think one of the best bang for buck performance monitoring features of  Sadie six MIN/performance monitoringis the performance counter event filtering

---+ Performance event filtering and transformation logic before counting

Next, I think one of the best bang for buck performance monitoring features of  x86 EMON performance monitoring is the performance counter event filtering. RISC-V has only the most primitive version of this.

Every x86 performance counter has per counter event select logic.

In addition to that logic, there is a mask that specifies what modes to cvount in - User, OS, hypervisor.  I see that some of the Ri5 proposals also have that. Good.

But more filtering is also provided:  

Each counter  has a "Counter Mask" CMASK - really, a threshold. When non-zero, this is compared to the count of the selected event in any given cycle. If >= CMASK, the counter is incremented by 1; if less, no increment.

=> This comparison allows a HW event to be used to profile things like "Number of cycles in which 1, 2 or more, 3 or more ... events happened - e.g. how often you are able to to acheive superscalar execution.  In vectors, it might count how many vector elements are masked in or out.   If you have events that correspond to buffer occupancy, you can profile to see where the buffer is full or not.

INV - a bit that allows the CMASK comparison to be inverted.

=> so that you can count event > threshold, and event < threshold.

I would really have liked to have >, <, and == threshold.  And also the ability to increment by 1 if exceeding threshold, or by the actual count that exceeds the threshold. The former allows you to find where you are getting good superscalar behavior or not, the latter allows you to determine the average when exceeding the threshold or not. When I do find this I had to save hardware.

This masked comparison allows you to get more different types of events, for events that occur more than one per cycle. That's pretty good, abut it doesn't help you with scarce events, t events that only occur once every four or eight or an cycles.  Later, Intel added what I call  "push-out"  profiling: when the comparison condition is met,  e.g. when no instruction retires, a counter that increments one every clock cycle starts ; when the condition changes, the value of that counter is what is recorded, and naturally subject to all of the natural filtering.  That was too much hardware for me to add in 1991, but it proved very useful.

My first instinct is always to minimize hardware cost for performance monitoring hardware.   The nice thing about the filtering logic at the performance counters is that it removed the hardware cost from the individual unit like the cache,
and left it in the performance monitoring unit.  (The question of centralized versus decentralized performance counters is always an issue.  Suffice it to say that Intel P6 had for centralized performance counters, to save hardware;
Pentium 4 went to a fully distributed performance counter architecture, but users hated it, so until return to the centralized model at least architecturally, although microarchitecture implementations might be decentralized )

More filtering logic: each count has an E, edge select bit.  This counts when the condition described by the privilege level mask and CMASK comparison changes.    Using such edge filtering, you can determine the average length of bursts, e.g. the average length of a period that you have not been able to execute any instruction, and so on.   Simple filters can give you average lengths and occupancies; fancier and more expensive stuff is necessary to actually determine a distribution.   

Certain events themselves are sent to the performance counters as bitmasks.  E.g. the utilization of the execution unit ports as a bitmask - on the original P6 { ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on modern machines.  By controlling the UMASK field of the filter control logic for each performance counter, you could specify to count all instruction dispatches, or just loads, and so on.   Changing the UMASK field allowed you to profile to find out which parts of the machine were being used and which not.   (This proved successful enough to get me and the software guy who eventually started using it in achievement award.)

If I were to do it over again I would have a generic POPCNT as part of the filter logic, as well as the comparison.

Finally, simple filter stuff:

INT - Performance interrupt enable

PC - pin control - this predated me: it toggled an external pin when the performance counter overflowed.  The very first EMON event sampling took that external pin and wired it back to the NMI pin of the CPU.  Obviously, it is better to have internal logic for performance monitoring interrupts. Nevertheless, there is still a need for externally visible performance event sampling, e.g. freeze performance events outside the CPU, in the fabric, or in I/O devices.  Exactly what those are is obviously implementation dependent, but it's still good to have a standard way of controlling such implementation dependent features. I call this the "pin architecture", and IMHO maintaining such system compatibility was as much a factor in Intel's success as instruction set architecture.

---+ Performance counter freeze

There are always several performance counters. At least two per privilege level. At least a pair, so you can compute things like cashless rates and other ratios.     But not necessarily dedicated to any specific privilege level, because that would be wasteful: you can study things a hell of a lot more quickly if you can use all of the performance counters, when other modes are not using them.

When you have several performance counters, you are often measuring things together. You therefore need the ability to freeze them all at the same time. This means that you need to have all of the enable bits for all of the counters, or at least a subset, in the same CSR.   If you want, the enable bit can be in both the per counter control registers and in a central CSR - i.e. there can be multiple views of the same bit indifferent CSRs.

Performance analysts would really like the ability to freeze multiple CPUs performance counters at the same time. This is one motivation for that pin control signal

---+ Precise performance monitoring inputs - You can only wish!

When you are doing performance counter event interrupt based sampling, it would be really nice if the interrupt occurred exactly at the instruction that had the event.

If you can do that, great. However, it takes extra hardware to do that. Also, some events do not in any way correspond to a retired instruction - think events that occur on speculative instructions that never graduate/retire. Again, you can create special registers that record, say, the program counter of such a speculative instruction, but that is again extra hardware.

IMHO there is zero chance that all implementations, particularly the lowest cost of limitations, will make all performance events precise.

At the very least there should be a way of discovering whether an event is precise or not.   

Machine check architectures have the same issue.


---+ What should the priority of the hardware performance monitoring interrupts be?

One of the best things I did for Intel was punt on this issue: because I was also the architect in charge of the APIC interrupt controller, I provided a special LVT interrupt register just for the performance monitoring interrupt.

This allowed the performance monitoring interrupt to use all of the APIC features, all of those that made sense. For example, the performance monitoring interrupt that Linux uses is just the normal interrupt of some priority.  But as I mentioned above, the very first usage of performance monitoring interrupts used NMI, and was therefore able to profile code that had interrupts blocked.   The interrupt would be directed to SMM, the not very good baby virtual machine monitor, allowing performance monitoring to be done independent of the operating system. Very nice, when you don't have source code for the operating system. And so on. I can't remember, but it is possible that the interrupt could be directed to other processors other than the local processor.  However, that would not subsume the externally visible pin control, because the hardware pin can be a lot less expensive a lot more precise than signaling and enter processor interrupt.

I used a similar approach for machine check interrupts, which could also be directed to the operating system, NMI, SMM, hypervisor,…

By the way: I think Greg Favor said that x86 is performance monitoring interrupts are  level sensitive. That is not strictly speaking true: whether they are level sensitive or not is programmed into the AIPAC local vector table. You can make it all sensitive or edge triggered.

Obviously, however, when there are multiple performance counters bearing the same interrupt, you need to know which counter overflowed. Hence the sticky bits that Greg noticed in the manuals.


---+ Fancier stuff

The above is mostly about getting the most out of simple performance counters:  providing filter logic so that you can get the most insight out of the limited number of events;
providing enables in a central place so that you can freeze multiple counters at the same time;  allowing the performance counter interrupts to be directed not just a different privilege levels but in different interrupt priorities including NMI,  and possibly also external hardware.

There's a lot more stuff that can be done to help performance monitoring. Unfortunately, I have always worked at a place where I had to reduce  the performance monitoring hardware cost as much as possible.   I am sure, however, that many of you are familiar with fancier performance monitoring features, such as

+ Histogram counting (allowing you to count distributions without making multiple runs)
    => the CMASK comparators allowing very simple form of this, assuming you have enough performance counters. Actual histogram counters can do this more cheaply.

+ Cycle attribution - defining performance events so that you can actually say things like  "X% of cycles are spent waitying for memory".   

IMHO the single most important "advanced"  performance monitoring feature is what I call "longitudinal profiling".   AMD Instruction Based Sampling (IBS), DEC ProfileMe, ARM SPE (Statistical Profiling Extension).  The basic idea is to set a bit on some randomly selected instruction package somewhere high up in the pipeline, e.g. yet instruction fetch, and then let that bit flow down the pipeline, sampling things as it goes. E.g. you might sample the past missed latency or address,  or whether it produced a stall in interaction with a different marked instruction. This sort of profiling is quite expensive, e.g. requiring a bit in many places in the pipeline, as well as registers to record the sample data, but it provides a lot of insight: it can give you distributions and averages, it can tell you what interactions between instructions are causing problems.

However, if RISC-V cannot yet afford to do longitudinal profiling, the performance counter filter logic that I described above is low hanging fruit, much cheaper.




From: Alankao <alankao@...>
Sent: Monday, July 20, 2020 5:43PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail client, so please forgive my spam.

Hi Brian,

> I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies
> each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

> Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different
> register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

> Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its
> purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".


alankao
 

Hi Andy,

Thank you for the hints as an Intel PMU architect.  My question is about the mode selection part as below.

It is not difficult to implement such a mechanism that an event should only be counted in some privileged modes.  Both Greg's and my approach can achieve this. But in practice, we found profiling higher-privileged modes has some problems. Under basic Unix-like RISC-V configuration, the kernel runs in S-mode and there is M-mode for platform-specific stuff. 

Says we now want to sample M-mode software. The first implement decision is which mode the HPM interrupt should go. Everything can be more controllable if the interrupt can just go S-mode, but obviously there is no easy way for S-mode software, the kernel, to read general M-mode information like mepc (Machine Exception Program Counter) register.  The other route goes to M-mode, but since RISC-V HPM interrupt has never been seriously/publicly discussed until this thread, the effort so far including current PMU SBI extension proposal did not address this.

I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues?


Andy Glew Si5
 

I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues?
x86's hardware performance monitoring interrupt  is delivered  to whatever is specified by the local APIC's LVT (Local Vector Table) entry for performance monitoring.    This gets it out of being a special case, and just makes it like any other interrupt. The hypervisor or virtual machine manager has to be able to handle any interrupt appropriately.  in a simple virtual machine architecture, such as the initial version of Intel VT that I worked on, all  external interrupts  go to the hypervisor, and then the hypervisor can decide if it wants to  deliver them to a guest privilege level. Fancier virtual machine architectures such as current Intel allow certain interrupts to be sent directly to  the guest, without  being caught by the  hypervisor first.

There should not be any special handling for hardware performance monitoring interests. It should be just like any other  interrupt or exception. There should be a uniform delegation architecture for all interests and traps.   Eliminate as many special cases as possible.

For any given interrupt or exception, sometimes you wanted to go straight to the hypervisor, sometimes you wanted to go straight to the guest..

I say "hypervisor" here,  but it might just as well be M-mode: or generalize, sometimes you want  it to go to the most privileged software level, sometimes to the least, sometimes one of the privileged software levels in between. The interrupt  architecture should support that.

--

There's a bit of funkiness with respect to precise  performance monitoring exceptions just like there is for machine check. If you go through a complicated interrupt vectoring mechanism, it may become  difficult to be precise. In fact, that's one of the reasons why P6's  original  performance monitoring interrupt forcing precise (that,  and the fact that it took several cycles to propagate from the unit where the event occurred to the performance counter logic, and not even a uniform number cycles -  there was a tree of wires with differing numbers  of latches on different paths).

But that is okay-ish:   You can either have an interlock to  prevent more instructions from retiring after the instruction where the precise performance monitor event has occurred.    taking care to avoid deadlock, e.g. taking care that a higher priority interrupt can preempt while that interlock is in flight. Or you can add the mechanisms  To provide appropriate sampling when interrupts are actually imprecise. Or, you can add a new interrupt/exception delivery mechanism  but basically does the first thing, but throw it out some of the complexity of your legacy trip delivery mechanism. It's microarchitecture.

By the way, if your performance counter takes more than one cycle to propagate carry  and detect overflow, you need such an interlock anyway.  That is not a common problem, but at Intel circa 2000  we regularly  imagined a "fireball" OOO core that  ran out of frequencies such that you could only propagate a carry  across 16 bits at a time;  if you are wire limited rather than logic limited, for cycles for 64-bit counter.   In fact, I believe that Intel PEBS (Precise Event Based Sampling)  does not actually sample when the counter overflows;  instead, when the counter overflows, it sets a bit that says the next time this event is recognized, then generate the interrupt. Which, if you think about it,  is actually imprecise, if more than one event occurs in any given cycle.

 However, propagating through the interrupt delivery logic probably takes more cycles than propagating a carry 64-bits.




From: Alankao <alankao@...>
Sent: Tuesday, July 21, 2020 4:40PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/21/2020 4:40 PM, alankao wrote:
Hi Andy,

Thank you for the hints as an Intel PMU architect.  My question is about the mode selection part as below.

It is not difficult to implement such a mechanism that an event should only be counted in some privileged modes.  Both Greg's and my approach can achieve this. But in practice, we found profiling higher-privileged modes has some problems. Under basic Unix-like RISC-V configuration, the kernel runs in S-mode and there is M-mode for platform-specific stuff. 

Says we now want to sample M-mode software. The first implement decision is which mode the HPM interrupt should go. Everything can be more controllable if the interrupt can just go S-mode, but obviously there is no easy way for S-mode software, the kernel, to read general M-mode information like mepc (Machine Exception Program Counter) register.  The other route goes to M-mode, but since RISC-V HPM interrupt has never been seriously/publicly discussed until this thread, the effort so far including current PMU SBI extension proposal did not address this.

I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues?

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".


Andy Glew Si5
 

I meant "that's one of the reasons why P6's original performance  monitoring interrupt was imprecise"
that's one of the reasons why P6's  original  performance monitoring interrupt forcing precise
 darn that speech misrecognition: sometimes it inverts the meaning of what I say.  :-(



From: Andy Glew Si5 Via Lists.riscv.org <andy.glew=sifive.com@...>
Sent: Tuesday, July 21, 2020 5:30PM
To: Alankao <alankao@...>, Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/21/2020 5:30 PM, Andy Glew Si5 via lists.riscv.org wrote:
I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues?
x86's hardware performance monitoring interrupt  is delivered  to whatever is specified by the local APIC's LVT (Local Vector Table) entry for performance monitoring.    This gets it out of being a special case, and just makes it like any other interrupt. The hypervisor or virtual machine manager has to be able to handle any interrupt appropriately.  in a simple virtual machine architecture, such as the initial version of Intel VT that I worked on, all  external interrupts  go to the hypervisor, and then the hypervisor can decide if it wants to  deliver them to a guest privilege level. Fancier virtual machine architectures such as current Intel allow certain interrupts to be sent directly to  the guest, without  being caught by the  hypervisor first.

There should not be any special handling for hardware performance monitoring interests. It should be just like any other  interrupt or exception. There should be a uniform delegation architecture for all interests and traps.   Eliminate as many special cases as possible.

For any given interrupt or exception, sometimes you wanted to go straight to the hypervisor, sometimes you wanted to go straight to the guest..

I say "hypervisor" here,  but it might just as well be M-mode: or generalize, sometimes you want  it to go to the most privileged software level, sometimes to the least, sometimes one of the privileged software levels in between. The interrupt  architecture should support that.

--

There's a bit of funkiness with respect to precise  performance monitoring exceptions just like there is for machine check. If you go through a complicated interrupt vectoring mechanism, it may become  difficult to be precise. In fact, that's one of the reasons why P6's  original  performance monitoring interrupt forcing precise (that,  and the fact that it took several cycles to propagate from the unit where the event occurred to the performance counter logic, and not even a uniform number cycles -  there was a tree of wires with differing numbers  of latches on different paths).

But that is okay-ish:   You can either have an interlock to  prevent more instructions from retiring after the instruction where the precise performance monitor event has occurred.    taking care to avoid deadlock, e.g. taking care that a higher priority interrupt can preempt while that interlock is in flight. Or you can add the mechanisms  To provide appropriate sampling when interrupts are actually imprecise. Or, you can add a new interrupt/exception delivery mechanism  but basically does the first thing, but throw it out some of the complexity of your legacy trip delivery mechanism. It's microarchitecture.

By the way, if your performance counter takes more than one cycle to propagate carry  and detect overflow, you need such an interlock anyway.  That is not a common problem, but at Intel circa 2000  we regularly  imagined a "fireball" OOO core that  ran out of frequencies such that you could only propagate a carry  across 16 bits at a time;  if you are wire limited rather than logic limited, for cycles for 64-bit counter.   In fact, I believe that Intel PEBS (Precise Event Based Sampling)  does not actually sample when the counter overflows;  instead, when the counter overflows, it sets a bit that says the next time this event is recognized, then generate the interrupt. Which, if you think about it,  is actually imprecise, if more than one event occurs in any given cycle.

 However, propagating through the interrupt delivery logic probably takes more cycles than propagating a carry 64-bits.




From: Alankao <alankao@...>
Sent: Tuesday, July 21, 2020 4:40PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/21/2020 4:40 PM, alankao wrote:
Hi Andy,

Thank you for the hints as an Intel PMU architect.  My question is about the mode selection part as below.

It is not difficult to implement such a mechanism that an event should only be counted in some privileged modes.  Both Greg's and my approach can achieve this. But in practice, we found profiling higher-privileged modes has some problems. Under basic Unix-like RISC-V configuration, the kernel runs in S-mode and there is M-mode for platform-specific stuff. 

Says we now want to sample M-mode software. The first implement decision is which mode the HPM interrupt should go. Everything can be more controllable if the interrupt can just go S-mode, but obviously there is no easy way for S-mode software, the kernel, to read general M-mode information like mepc (Machine Exception Program Counter) register.  The other route goes to M-mode, but since RISC-V HPM interrupt has never been seriously/publicly discussed until this thread, the effort so far including current PMU SBI extension proposal did not address this.

I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues?

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".


Greg Favor
 

A general comment about all the fancy forms of event filtering that can be nice to have:

The most basic one of general applicability is mode-specific filtering.  Past that one could try to define some further general filtering capabilities that aren't too specialized, but one quickly gets into having interesting filtering features specific to an event or class of events.

We take the view (in our design) that the lower 16 bits of mhpmevent are used for event selection in the broad sense of the word.  For us, the low 8 bits select between one of 256 "types" of events and then the upper 8 bits provide pretty flexible event-specific filtering.  That has turned out to very nicely support a very large variety of events in a very manageable way hardware-wise - in contrast to having many hundreds (or more) individual events.  But that is just our own implementation of event selection.

My point is that someone else can do similar or not so similar things in their own design with whatever types of general or event-specific filtering features that they might desire.  Trying to standardize that stuff can be tricky to say the least.

For now, at least, I think we should just let people decide what events and event filtering they feel are valuable in their designs.  We should only try to standardize any filtering features that are broadly applicable and valuable to have.  For myself, that results in just proposing mode-specific event filtering.

Greg


On Tue, Jul 21, 2020 at 3:53 PM Andy Glew Si5 <andy.glew@...> wrote:
I have NOT been working on a RISC-V performance monitoring proposal, but I've been involved with performance monitoring first as a user then is an architect for many years and at several companies.

I would like to draw this group's attention to some features of Intel x86 performance monitoring  that turned out pretty successful.

First, you're already talking about hardware performance monitoring interrupts for statistical profiling. Good. A few comments on that below.

But, I think one of the best bang for buck performance monitoring features of  Sadie six MIN/performance monitoringis the performance counter event filtering

---+ Performance event filtering and transformation logic before counting

Next, I think one of the best bang for buck performance monitoring features of  x86 EMON performance monitoring is the performance counter event filtering. RISC-V has only the most primitive version of this.

Every x86 performance counter has per counter event select logic.

In addition to that logic, there is a mask that specifies what modes to cvount in - User, OS, hypervisor.  I see that some of the Ri5 proposals also have that. Good.

But more filtering is also provided:  

Each counter  has a "Counter Mask" CMASK - really, a threshold. When non-zero, this is compared to the count of the selected event in any given cycle. If >= CMASK, the counter is incremented by 1; if less, no increment.

=> This comparison allows a HW event to be used to profile things like "Number of cycles in which 1, 2 or more, 3 or more ... events happened - e.g. how often you are able to to acheive superscalar execution.  In vectors, it might count how many vector elements are masked in or out.   If you have events that correspond to buffer occupancy, you can profile to see where the buffer is full or not.

INV - a bit that allows the CMASK comparison to be inverted.

=> so that you can count event > threshold, and event < threshold.

I would really have liked to have >, <, and == threshold.  And also the ability to increment by 1 if exceeding threshold, or by the actual count that exceeds the threshold. The former allows you to find where you are getting good superscalar behavior or not, the latter allows you to determine the average when exceeding the threshold or not. When I do find this I had to save hardware.

This masked comparison allows you to get more different types of events, for events that occur more than one per cycle. That's pretty good, abut it doesn't help you with scarce events, t events that only occur once every four or eight or an cycles.  Later, Intel added what I call  "push-out"  profiling: when the comparison condition is met,  e.g. when no instruction retires, a counter that increments one every clock cycle starts ; when the condition changes, the value of that counter is what is recorded, and naturally subject to all of the natural filtering.  That was too much hardware for me to add in 1991, but it proved very useful.

My first instinct is always to minimize hardware cost for performance monitoring hardware.   The nice thing about the filtering logic at the performance counters is that it removed the hardware cost from the individual unit like the cache,
and left it in the performance monitoring unit.  (The question of centralized versus decentralized performance counters is always an issue.  Suffice it to say that Intel P6 had for centralized performance counters, to save hardware;
Pentium 4 went to a fully distributed performance counter architecture, but users hated it, so until return to the centralized model at least architecturally, although microarchitecture implementations might be decentralized )

More filtering logic: each count has an E, edge select bit.  This counts when the condition described by the privilege level mask and CMASK comparison changes.    Using such edge filtering, you can determine the average length of bursts, e.g. the average length of a period that you have not been able to execute any instruction, and so on.   Simple filters can give you average lengths and occupancies; fancier and more expensive stuff is necessary to actually determine a distribution.   

Certain events themselves are sent to the performance counters as bitmasks.  E.g. the utilization of the execution unit ports as a bitmask - on the original P6 { ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on modern machines.  By controlling the UMASK field of the filter control logic for each performance counter, you could specify to count all instruction dispatches, or just loads, and so on.   Changing the UMASK field allowed you to profile to find out which parts of the machine were being used and which not.   (This proved successful enough to get me and the software guy who eventually started using it in achievement award.)

If I were to do it over again I would have a generic POPCNT as part of the filter logic, as well as the comparison.

Finally, simple filter stuff:

INT - Performance interrupt enable

PC - pin control - this predated me: it toggled an external pin when the performance counter overflowed.  The very first EMON event sampling took that external pin and wired it back to the NMI pin of the CPU.  Obviously, it is better to have internal logic for performance monitoring interrupts. Nevertheless, there is still a need for externally visible performance event sampling, e.g. freeze performance events outside the CPU, in the fabric, or in I/O devices.  Exactly what those are is obviously implementation dependent, but it's still good to have a standard way of controlling such implementation dependent features. I call this the "pin architecture", and IMHO maintaining such system compatibility was as much a factor in Intel's success as instruction set architecture.

---+ Performance counter freeze

There are always several performance counters. At least two per privilege level. At least a pair, so you can compute things like cashless rates and other ratios.     But not necessarily dedicated to any specific privilege level, because that would be wasteful: you can study things a hell of a lot more quickly if you can use all of the performance counters, when other modes are not using them.

When you have several performance counters, you are often measuring things together. You therefore need the ability to freeze them all at the same time. This means that you need to have all of the enable bits for all of the counters, or at least a subset, in the same CSR.   If you want, the enable bit can be in both the per counter control registers and in a central CSR - i.e. there can be multiple views of the same bit indifferent CSRs.

Performance analysts would really like the ability to freeze multiple CPUs performance counters at the same time. This is one motivation for that pin control signal

---+ Precise performance monitoring inputs - You can only wish!

When you are doing performance counter event interrupt based sampling, it would be really nice if the interrupt occurred exactly at the instruction that had the event.

If you can do that, great. However, it takes extra hardware to do that. Also, some events do not in any way correspond to a retired instruction - think events that occur on speculative instructions that never graduate/retire. Again, you can create special registers that record, say, the program counter of such a speculative instruction, but that is again extra hardware.

IMHO there is zero chance that all implementations, particularly the lowest cost of limitations, will make all performance events precise.

At the very least there should be a way of discovering whether an event is precise or not.   

Machine check architectures have the same issue.


---+ What should the priority of the hardware performance monitoring interrupts be?

One of the best things I did for Intel was punt on this issue: because I was also the architect in charge of the APIC interrupt controller, I provided a special LVT interrupt register just for the performance monitoring interrupt.

This allowed the performance monitoring interrupt to use all of the APIC features, all of those that made sense. For example, the performance monitoring interrupt that Linux uses is just the normal interrupt of some priority.  But as I mentioned above, the very first usage of performance monitoring interrupts used NMI, and was therefore able to profile code that had interrupts blocked.   The interrupt would be directed to SMM, the not very good baby virtual machine monitor, allowing performance monitoring to be done independent of the operating system. Very nice, when you don't have source code for the operating system. And so on. I can't remember, but it is possible that the interrupt could be directed to other processors other than the local processor.  However, that would not subsume the externally visible pin control, because the hardware pin can be a lot less expensive a lot more precise than signaling and enter processor interrupt.

I used a similar approach for machine check interrupts, which could also be directed to the operating system, NMI, SMM, hypervisor,…

By the way: I think Greg Favor said that x86 is performance monitoring interrupts are  level sensitive. That is not strictly speaking true: whether they are level sensitive or not is programmed into the AIPAC local vector table. You can make it all sensitive or edge triggered.

Obviously, however, when there are multiple performance counters bearing the same interrupt, you need to know which counter overflowed. Hence the sticky bits that Greg noticed in the manuals.


---+ Fancier stuff

The above is mostly about getting the most out of simple performance counters:  providing filter logic so that you can get the most insight out of the limited number of events;
providing enables in a central place so that you can freeze multiple counters at the same time;  allowing the performance counter interrupts to be directed not just a different privilege levels but in different interrupt priorities including NMI,  and possibly also external hardware.

There's a lot more stuff that can be done to help performance monitoring. Unfortunately, I have always worked at a place where I had to reduce  the performance monitoring hardware cost as much as possible.   I am sure, however, that many of you are familiar with fancier performance monitoring features, such as

+ Histogram counting (allowing you to count distributions without making multiple runs)
    => the CMASK comparators allowing very simple form of this, assuming you have enough performance counters. Actual histogram counters can do this more cheaply.

+ Cycle attribution - defining performance events so that you can actually say things like  "X% of cycles are spent waitying for memory".   

IMHO the single most important "advanced"  performance monitoring feature is what I call "longitudinal profiling".   AMD Instruction Based Sampling (IBS), DEC ProfileMe, ARM SPE (Statistical Profiling Extension).  The basic idea is to set a bit on some randomly selected instruction package somewhere high up in the pipeline, e.g. yet instruction fetch, and then let that bit flow down the pipeline, sampling things as it goes. E.g. you might sample the past missed latency or address,  or whether it produced a stall in interaction with a different marked instruction. This sort of profiling is quite expensive, e.g. requiring a bit in many places in the pipeline, as well as registers to record the sample data, but it provides a lot of insight: it can give you distributions and averages, it can tell you what interactions between instructions are causing problems.

However, if RISC-V cannot yet afford to do longitudinal profiling, the performance counter filter logic that I described above is low hanging fruit, much cheaper.




From: Alankao <alankao@...>
Sent: Monday, July 20, 2020 5:43PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail client, so please forgive my spam.

Hi Brian,

> I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies
> each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

> Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different
> register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

> Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its
> purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".


Andy Glew Si5
 

You seem to be missing the whole point: the x86 PerfMon/ EMON event filtering  is generic.
For us, the low 8 bits select between one of 256 "types" of events
 Ditto for Intel:  in fact, x86 PERFEVTSEL has precisely an 8-bit field  to select which type of event. That may have increased, although it still seems to be only eight bits inside the current manuals that I just downloaded. 


the upper 8 bits provide pretty flexible event-specific filtering
Similarly, Intel PERFEVTSEL has an 8 bit  UMASK for event specific filtering.
 

However, there are further  fields that define generic event filtering. filters that you get for free, without having to design them on a per "event type" basis.


(if you care about implementation, the UMASK field stands for "Unit Mask" and is  propagated  to whatever hardware unit is actually performing the measurement.   the other filter bits live at performance counter,  and therefore apply to all events.)




The CMASK comparison  is applicable to, and relevant to, any event  that  can increment by more than one per clock cycle.

The E edge trigger is relevant to, and applicable to,  any event that  occurs in bursts of back-to-back events  in adjacent clock cycles.

That has turned out to very nicely support a very large variety of events in a very manageable way hardware-wise - in contrast to having many hundreds (or more) individual events.
Ditto with Intel.  A small number of events, filtered and transformed in several different ways.


The main difference, is that all of your filtering is event specific, and therefore you can't write portable code the takes advantage of it. Whereas most of the Intel filtering is. So you can write portable code the takes advantage of it.


There is, of course, some event specific filtering.    I also observed patterns in that event specific filtering that I think would be quite usefully standardized:  like that part about masking different port combinations.   however, that is not quite as generic as  a comparison threshold that applies to every event like increment by more than one in a clock cycle.

The "push-out"  profiling feature could also be generic,  counting   the length of intervals in which no event occurs, for any event.   I did not do this in P6  because  push-out profiling requires an extra counter, even if it's only a few bits like six.

Similarly, you could tweak the edge detect filter to smooth over a few clock cycles.  again, that would be generic.




















From: Greg Favor <gfavor@...>
Sent: Tuesday, July 21, 2020 6:16PM
To: Andy Glew <andy.glew@...>
Cc: Alankao <alankao@...>, Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/21/2020 6:16 PM, Greg Favor wrote:
A general comment about all the fancy forms of event filtering that can be nice to have:

The most basic one of general applicability is mode-specific filtering.  Past that one could try to define some further general filtering capabilities that aren't too specialized, but one quickly gets into having interesting filtering features specific to an event or class of events.

We take the view (in our design) that the lower 16 bits of mhpmevent are used for event selection in the broad sense of the word.  For us, the low 8 bits select between one of 256 "types" of events and then the upper 8 bits provide pretty flexible event-specific filtering.  That has turned out to very nicely support a very large variety of events in a very manageable way hardware-wise - in contrast to having many hundreds (or more) individual events.  But that is just our own implementation of event selection.

My point is that someone else can do similar or not so similar things in their own design with whatever types of general or event-specific filtering features that they might desire.  Trying to standardize that stuff can be tricky to say the least.

For now, at least, I think we should just let people decide what events and event filtering they feel are valuable in their designs.  We should only try to standardize any filtering features that are broadly applicable and valuable to have.  For myself, that results in just proposing mode-specific event filtering.

Greg


On Tue, Jul 21, 2020 at 3:53 PM Andy Glew Si5 <andy.glew@...> wrote:
I have NOT been working on a RISC-V performance monitoring proposal, but I've been involved with performance monitoring first as a user then is an architect for many years and at several companies.

I would like to draw this group's attention to some features of Intel x86 performance monitoring  that turned out pretty successful.

First, you're already talking about hardware performance monitoring interrupts for statistical profiling. Good. A few comments on that below.

But, I think one of the best bang for buck performance monitoring features of  Sadie six MIN/performance monitoringis the performance counter event filtering

---+ Performance event filtering and transformation logic before counting

Next, I think one of the best bang for buck performance monitoring features of  x86 EMON performance monitoring is the performance counter event filtering. RISC-V has only the most primitive version of this.

Every x86 performance counter has per counter event select logic.

In addition to that logic, there is a mask that specifies what modes to cvount in - User, OS, hypervisor.  I see that some of the Ri5 proposals also have that. Good.

But more filtering is also provided:  

Each counter  has a "Counter Mask" CMASK - really, a threshold. When non-zero, this is compared to the count of the selected event in any given cycle. If >= CMASK, the counter is incremented by 1; if less, no increment.

=> This comparison allows a HW event to be used to profile things like "Number of cycles in which 1, 2 or more, 3 or more ... events happened - e.g. how often you are able to to acheive superscalar execution.  In vectors, it might count how many vector elements are masked in or out.   If you have events that correspond to buffer occupancy, you can profile to see where the buffer is full or not.

INV - a bit that allows the CMASK comparison to be inverted.

=> so that you can count event > threshold, and event < threshold.

I would really have liked to have >, <, and == threshold.  And also the ability to increment by 1 if exceeding threshold, or by the actual count that exceeds the threshold. The former allows you to find where you are getting good superscalar behavior or not, the latter allows you to determine the average when exceeding the threshold or not. When I do find this I had to save hardware.

This masked comparison allows you to get more different types of events, for events that occur more than one per cycle. That's pretty good, abut it doesn't help you with scarce events, t events that only occur once every four or eight or an cycles.  Later, Intel added what I call  "push-out"  profiling: when the comparison condition is met,  e.g. when no instruction retires, a counter that increments one every clock cycle starts ; when the condition changes, the value of that counter is what is recorded, and naturally subject to all of the natural filtering.  That was too much hardware for me to add in 1991, but it proved very useful.

My first instinct is always to minimize hardware cost for performance monitoring hardware.   The nice thing about the filtering logic at the performance counters is that it removed the hardware cost from the individual unit like the cache,
and left it in the performance monitoring unit.  (The question of centralized versus decentralized performance counters is always an issue.  Suffice it to say that Intel P6 had for centralized performance counters, to save hardware;
Pentium 4 went to a fully distributed performance counter architecture, but users hated it, so until return to the centralized model at least architecturally, although microarchitecture implementations might be decentralized )

More filtering logic: each count has an E, edge select bit.  This counts when the condition described by the privilege level mask and CMASK comparison changes.    Using such edge filtering, you can determine the average length of bursts, e.g. the average length of a period that you have not been able to execute any instruction, and so on.   Simple filters can give you average lengths and occupancies; fancier and more expensive stuff is necessary to actually determine a distribution.   

Certain events themselves are sent to the performance counters as bitmasks.  E.g. the utilization of the execution unit ports as a bitmask - on the original P6 { ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on modern machines.  By controlling the UMASK field of the filter control logic for each performance counter, you could specify to count all instruction dispatches, or just loads, and so on.   Changing the UMASK field allowed you to profile to find out which parts of the machine were being used and which not.   (This proved successful enough to get me and the software guy who eventually started using it in achievement award.)

If I were to do it over again I would have a generic POPCNT as part of the filter logic, as well as the comparison.

Finally, simple filter stuff:

INT - Performance interrupt enable

PC - pin control - this predated me: it toggled an external pin when the performance counter overflowed.  The very first EMON event sampling took that external pin and wired it back to the NMI pin of the CPU.  Obviously, it is better to have internal logic for performance monitoring interrupts. Nevertheless, there is still a need for externally visible performance event sampling, e.g. freeze performance events outside the CPU, in the fabric, or in I/O devices.  Exactly what those are is obviously implementation dependent, but it's still good to have a standard way of controlling such implementation dependent features. I call this the "pin architecture", and IMHO maintaining such system compatibility was as much a factor in Intel's success as instruction set architecture.

---+ Performance counter freeze

There are always several performance counters. At least two per privilege level. At least a pair, so you can compute things like cashless rates and other ratios.     But not necessarily dedicated to any specific privilege level, because that would be wasteful: you can study things a hell of a lot more quickly if you can use all of the performance counters, when other modes are not using them.

When you have several performance counters, you are often measuring things together. You therefore need the ability to freeze them all at the same time. This means that you need to have all of the enable bits for all of the counters, or at least a subset, in the same CSR.   If you want, the enable bit can be in both the per counter control registers and in a central CSR - i.e. there can be multiple views of the same bit indifferent CSRs.

Performance analysts would really like the ability to freeze multiple CPUs performance counters at the same time. This is one motivation for that pin control signal

---+ Precise performance monitoring inputs - You can only wish!

When you are doing performance counter event interrupt based sampling, it would be really nice if the interrupt occurred exactly at the instruction that had the event.

If you can do that, great. However, it takes extra hardware to do that. Also, some events do not in any way correspond to a retired instruction - think events that occur on speculative instructions that never graduate/retire. Again, you can create special registers that record, say, the program counter of such a speculative instruction, but that is again extra hardware.

IMHO there is zero chance that all implementations, particularly the lowest cost of limitations, will make all performance events precise.

At the very least there should be a way of discovering whether an event is precise or not.   

Machine check architectures have the same issue.


---+ What should the priority of the hardware performance monitoring interrupts be?

One of the best things I did for Intel was punt on this issue: because I was also the architect in charge of the APIC interrupt controller, I provided a special LVT interrupt register just for the performance monitoring interrupt.

This allowed the performance monitoring interrupt to use all of the APIC features, all of those that made sense. For example, the performance monitoring interrupt that Linux uses is just the normal interrupt of some priority.  But as I mentioned above, the very first usage of performance monitoring interrupts used NMI, and was therefore able to profile code that had interrupts blocked.   The interrupt would be directed to SMM, the not very good baby virtual machine monitor, allowing performance monitoring to be done independent of the operating system. Very nice, when you don't have source code for the operating system. And so on. I can't remember, but it is possible that the interrupt could be directed to other processors other than the local processor.  However, that would not subsume the externally visible pin control, because the hardware pin can be a lot less expensive a lot more precise than signaling and enter processor interrupt.

I used a similar approach for machine check interrupts, which could also be directed to the operating system, NMI, SMM, hypervisor,…

By the way: I think Greg Favor said that x86 is performance monitoring interrupts are  level sensitive. That is not strictly speaking true: whether they are level sensitive or not is programmed into the AIPAC local vector table. You can make it all sensitive or edge triggered.

Obviously, however, when there are multiple performance counters bearing the same interrupt, you need to know which counter overflowed. Hence the sticky bits that Greg noticed in the manuals.


---+ Fancier stuff

The above is mostly about getting the most out of simple performance counters:  providing filter logic so that you can get the most insight out of the limited number of events;
providing enables in a central place so that you can freeze multiple counters at the same time;  allowing the performance counter interrupts to be directed not just a different privilege levels but in different interrupt priorities including NMI,  and possibly also external hardware.

There's a lot more stuff that can be done to help performance monitoring. Unfortunately, I have always worked at a place where I had to reduce  the performance monitoring hardware cost as much as possible.   I am sure, however, that many of you are familiar with fancier performance monitoring features, such as

+ Histogram counting (allowing you to count distributions without making multiple runs)
    => the CMASK comparators allowing very simple form of this, assuming you have enough performance counters. Actual histogram counters can do this more cheaply.

+ Cycle attribution - defining performance events so that you can actually say things like  "X% of cycles are spent waitying for memory".   

IMHO the single most important "advanced"  performance monitoring feature is what I call "longitudinal profiling".   AMD Instruction Based Sampling (IBS), DEC ProfileMe, ARM SPE (Statistical Profiling Extension).  The basic idea is to set a bit on some randomly selected instruction package somewhere high up in the pipeline, e.g. yet instruction fetch, and then let that bit flow down the pipeline, sampling things as it goes. E.g. you might sample the past missed latency or address,  or whether it produced a stall in interaction with a different marked instruction. This sort of profiling is quite expensive, e.g. requiring a bit in many places in the pipeline, as well as registers to record the sample data, but it provides a lot of insight: it can give you distributions and averages, it can tell you what interactions between instructions are causing problems.

However, if RISC-V cannot yet afford to do longitudinal profiling, the performance counter filter logic that I described above is low hanging fruit, much cheaper.




From: Alankao <alankao@...>
Sent: Monday, July 20, 2020 5:43PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail client, so please forgive my spam.

Hi Brian,

> I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies
> each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

> Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different
> register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

> Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its
> purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".


alankao
 

Hi all,

Although there were some non-resolved discussions, it has little to do with what we should do for the next step.  I believe Greg's proposal is superior to the original one in the starting thread because

1.  It reuses `hpmevents` for most of the functions that we all agree that RISC-V needs, instead of adding a bunch of new registers.
2.  It is H-ext-aware

I suggest Greg take the lead to start a PR in the ISA Repo, I can help review and evaluate the effort to patch existing software.

Thanks,
Alan


Greg Favor
 

Alan,

I'm fine with taking the lead on this architecture extension.  But it should follow a proper process as directed by the TSC.  Thus far this would mean getting a new TG created or doing something less formally under an existing TG.  But for smaller extension proposals like this there is need for a proper lighter weight and faster process.  Need for this is recognized and I suspect will probably be promulgated by the TSC some time soon.

So I suggest we pause for a short bit, and then see if we can follow that expedited process once it is available.  In the meantime I/we can prepare what we can in advance.  (I don't think this will represent a material slow down to getting to a frozen spec and then to ratification.)

Greg


On Wed, Jul 29, 2020 at 5:24 PM alankao <alankao@...> wrote:
Hi all,

Although there were some non-resolved discussions, it has little to do with what we should do for the next step.  I believe Greg's proposal is superior to the original one in the starting thread because

1.  It reuses `hpmevents` for most of the functions that we all agree that RISC-V needs, instead of adding a bunch of new registers.
2.  It is H-ext-aware

I suggest Greg take the lead to start a PR in the ISA Repo, I can help review and evaluate the effort to patch existing software.

Thanks,
Alan


Anup Patel
 

Hi Greg,

 

Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):

 

1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR

2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”.

3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Greg Favor
Sent: 30 July 2020 12:57
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Alan,

 

I'm fine with taking the lead on this architecture extension.  But it should follow a proper process as directed by the TSC.  Thus far this would mean getting a new TG created or doing something less formally under an existing TG.  But for smaller extension proposals like this there is need for a proper lighter weight and faster process.  Need for this is recognized and I suspect will probably be promulgated by the TSC some time soon.

 

So I suggest we pause for a short bit, and then see if we can follow that expedited process once it is available.  In the meantime I/we can prepare what we can in advance.  (I don't think this will represent a material slow down to getting to a frozen spec and then to ratification.)

 

Greg

 

On Wed, Jul 29, 2020 at 5:24 PM alankao <alankao@...> wrote:

Hi all,

Although there were some non-resolved discussions, it has little to do with what we should do for the next step.  I believe Greg's proposal is superior to the original one in the starting thread because

1.  It reuses `hpmevents` for most of the functions that we all agree that RISC-V needs, instead of adding a bunch of new registers.
2.  It is H-ext-aware

I suggest Greg take the lead to start a PR in the ISA Repo, I can help review and evaluate the effort to patch existing software.

Thanks,
Alan


Greg Favor
 

Anup,

Thanks.  Comments below.

Greg

On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:

Hi Greg,

 

Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):

 

1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR


This is up in the air for inclusion or not in the proposal.  As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak.  Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared.  But with this bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there).

The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit.  For example, in response to a cross-trigger from the debug Trigger Module (e.g. to start counting when a certain instruction executed and to stop counting when another address is executed).  Or to start/stop counting in response to another counter overflowing after N occurrences of some event.  Currently cross-trigger capabilities like this aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires. 


2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”.

3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Greg Favor
Sent: 30 July 2020 12:57
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Alan,

 

I'm fine with taking the lead on this architecture extension.  But it should follow a proper process as directed by the TSC.  Thus far this would mean getting a new TG created or doing something less formally under an existing TG.  But for smaller extension proposals like this there is need for a proper lighter weight and faster process.  Need for this is recognized and I suspect will probably be promulgated by the TSC some time soon.

 

So I suggest we pause for a short bit, and then see if we can follow that expedited process once it is available.  In the meantime I/we can prepare what we can in advance.  (I don't think this will represent a material slow down to getting to a frozen spec and then to ratification.)

 

Greg

 

On Wed, Jul 29, 2020 at 5:24 PM alankao <alankao@...> wrote:

Hi all,

Although there were some non-resolved discussions, it has little to do with what we should do for the next step.  I believe Greg's proposal is superior to the original one in the starting thread because

1.  It reuses `hpmevents` for most of the functions that we all agree that RISC-V needs, instead of adding a bunch of new registers.
2.  It is H-ext-aware

I suggest Greg take the lead to start a PR in the ISA Repo, I can help review and evaluate the effort to patch existing software.

Thanks,
Alan


Greg Favor
 

Email accidentally sent early.  Let me finish the email and then I'll send it again.

Greg


On Mon, Aug 3, 2020 at 9:41 PM Greg Favor via lists.riscv.org <gfavor=ventanamicro.com@...> wrote:
Anup,

Thanks.  Comments below.

Greg

On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:

Hi Greg,

 

Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):

 

1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR


This is up in the air for inclusion or not in the proposal.  As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak.  Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared.  But with this bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there).

The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit.  For example, in response to a cross-trigger from the debug Trigger Module (e.g. to start counting when a certain instruction executed and to stop counting when another address is executed).  Or to start/stop counting in response to another counter overflowing after N occurrences of some event.  Currently cross-trigger capabilities like this aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires. 


2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”.

3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Greg Favor
Sent: 30 July 2020 12:57
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Alan,

 

I'm fine with taking the lead on this architecture extension.  But it should follow a proper process as directed by the TSC.  Thus far this would mean getting a new TG created or doing something less formally under an existing TG.  But for smaller extension proposals like this there is need for a proper lighter weight and faster process.  Need for this is recognized and I suspect will probably be promulgated by the TSC some time soon.

 

So I suggest we pause for a short bit, and then see if we can follow that expedited process once it is available.  In the meantime I/we can prepare what we can in advance.  (I don't think this will represent a material slow down to getting to a frozen spec and then to ratification.)

 

Greg

 

On Wed, Jul 29, 2020 at 5:24 PM alankao <alankao@...> wrote:

Hi all,

Although there were some non-resolved discussions, it has little to do with what we should do for the next step.  I believe Greg's proposal is superior to the original one in the starting thread because

1.  It reuses `hpmevents` for most of the functions that we all agree that RISC-V needs, instead of adding a bunch of new registers.
2.  It is H-ext-aware

I suggest Greg take the lead to start a PR in the ISA Repo, I can help review and evaluate the effort to patch existing software.

Thanks,
Alan


Greg Favor
 

Anup,

Thanks.  Comments below.

Greg

On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:

Hi Greg,

 

Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):

 

1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR


This is up in the air for inclusion or not in the proposal.  As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak.  Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared.  But with this Active bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there).

The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit.  For example, in response to a cross-trigger from the debug Trigger Module, e.g. to start counting when a certain instruction executed and to stop counting when another address is executed.  Or to start/stop counting in response to another counter overflowing after N occurrences of some event.  In essence, for counting more complex types of event conditions, particularly in debug scenarios and less so in straight perf mon scenarios.

Currently cross-trigger capabilities like these aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires.  And note that hardware manipulation of mcountinhibit bits would be a change to the architectural definition of mcountinhibit.  This isn't a forcing issue, but having this Active bit in mhpmevent sidesteps that issue.

But even with all this, it is still up in the air whether people want or don't want to standardize this separate counter control bit as part of a counter extension.  We'll see where people fall on this.
 

2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”.


Good point.  To support this feature (which some others have also been requesting) will require defining an alias CSR for each hpmcounter CSR that is "User-Read-Write".

Having two User aliases of the same CSR is conceptually not pretty, but this is simple and seems like a necessary evil for supporting this feature.

Like above, we'll have to see if the interest in this feature is significant enough to warrant adding read/write hpmcounter aliases.
 

3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.


In my mind event_info simply fills in all the higher bits of mhpmevent that are not written by event_idx - which I believe was to be the default code path in the SBI PMU code.  (This, of course, applies for future implementations that choose to organize their mhpmevent registers in this simple manner.  Implementations are free to organize their mhpmevent CSR differently and supply corresponding implementation-specific SBI code.)  In other words (for RV64):

mhpmevent[19:16] = event_idx.type
mhpmevent[15:  0] = event_idx.code
mhpmevent[63:20] = event_info[43:0]

Greg
 

 

Regards,

Anup



Anup Patel
 

Hi Greg,

 

We have SBI_PMU_COUNTER_START/STOP calls where the SBI implementation will update the MCOUNTINHIBIT bits. The SBI_PMU_COUNTER_START call also take parameter for initial value of counter so we don’t need HPMCOUNTER CSR to be writeable in S-mode and this will also avoid alias CSRs.

 

I think we can remove BIT[28] in your proposal.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Greg Favor
Sent: 04 August 2020 10:42
To: Anup Patel <Anup.Patel@...>
Cc: alankao <alankao@...>; tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Anup,

 

Thanks.  Comments below.

 

Greg

 

On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:

Hi Greg,

 

Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):

 

1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR

 

This is up in the air for inclusion or not in the proposal.  As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak.  Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared.  But with this Active bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there).

 

The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit.  For example, in response to a cross-trigger from the debug Trigger Module, e.g. to start counting when a certain instruction executed and to stop counting when another address is executed.  Or to start/stop counting in response to another counter overflowing after N occurrences of some event.  In essence, for counting more complex types of event conditions, particularly in debug scenarios and less so in straight perf mon scenarios.

 

Currently cross-trigger capabilities like these aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires.  And note that hardware manipulation of mcountinhibit bits would be a change to the architectural definition of mcountinhibit.  This isn't a forcing issue, but having this Active bit in mhpmevent sidesteps that issue.

 

But even with all this, it is still up in the air whether people want or don't want to standardize this separate counter control bit as part of a counter extension.  We'll see where people fall on this.

 

2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”.

 

Good point.  To support this feature (which some others have also been requesting) will require defining an alias CSR for each hpmcounter CSR that is "User-Read-Write".

 

Having two User aliases of the same CSR is conceptually not pretty, but this is simple and seems like a necessary evil for supporting this feature.

 

Like above, we'll have to see if the interest in this feature is significant enough to warrant adding read/write hpmcounter aliases.

 

3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.

 

In my mind event_info simply fills in all the higher bits of mhpmevent that are not written by event_idx - which I believe was to be the default code path in the SBI PMU code.  (This, of course, applies for future implementations that choose to organize their mhpmevent registers in this simple manner.  Implementations are free to organize their mhpmevent CSR differently and supply corresponding implementation-specific SBI code.)  In other words (for RV64):

 

mhpmevent[19:16] = event_idx.type

mhpmevent[15:  0] = event_idx.code

mhpmevent[63:20] = event_info[43:0]

 

Greg

 

 

Regards,

Anup

 


Anup Patel
 

Hi Greg,

 

No issues with Bit[31] of your proposed MHPMEVENT definition. The SBI_PMU_COUNTER_START/STOP calls can either update MCOUNTINHIBIT Bits or these calls can update Bit[31] of appropriate MHPMEVENT CSR.

 

Regarding Bit[28], I agree with you. Let’s wait for more comments.

 

Regards,

Anup

 

From: Greg Favor <gfavor@...>
Sent: 04 August 2020 11:40
To: Anup Patel <Anup.Patel@...>
Cc: alankao <alankao@...>; tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On Mon, Aug 3, 2020 at 10:57 PM Anup Patel <Anup.Patel@...> wrote:

Hi Greg,

 

We have SBI_PMU_COUNTER_START/STOP calls where the SBI implementation will update the MCOUNTINHIBIT bits.

 

That reduces the argument for bit [31].  I won't remove it yet (until I write up an updated proposal), but I imagine that bit will be dropped if no one else speaks up in support of it.  (Although if/when someone (such as us) supports hardware events starting and stopping counters, then we'll have to deal with the fact that this is a change to the current arch definition of the mcountinhibit CSR.)

 

The SBI_PMU_COUNTER_START call also take parameter for initial value of counter so we don’t need HPMCOUNTER CSR to be writeable in S-mode and this will also avoid alias CSRs.

 

I think we can remove BIT[28] in your proposal.

 

I'm OK with this.  It is other people that have been more desirous of this feature.  I agree that it would be cleaner and simpler to not have this bit, but let's see who speaks up for keeping this feature.

 

Greg

 

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Greg Favor
Sent: 04 August 2020 10:42
To: Anup Patel <Anup.Patel@...>
Cc: alankao <alankao@...>; tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Anup,

 

Thanks.  Comments below.

 

Greg

 

On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:

Hi Greg,

 

Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):

 

1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR

 

This is up in the air for inclusion or not in the proposal.  As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak.  Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared.  But with this Active bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there).

 

The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit.  For example, in response to a cross-trigger from the debug Trigger Module, e.g. to start counting when a certain instruction executed and to stop counting when another address is executed.  Or to start/stop counting in response to another counter overflowing after N occurrences of some event.  In essence, for counting more complex types of event conditions, particularly in debug scenarios and less so in straight perf mon scenarios.

 

Currently cross-trigger capabilities like these aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires.  And note that hardware manipulation of mcountinhibit bits would be a change to the architectural definition of mcountinhibit.  This isn't a forcing issue, but having this Active bit in mhpmevent sidesteps that issue.

 

But even with all this, it is still up in the air whether people want or don't want to standardize this separate counter control bit as part of a counter extension.  We'll see where people fall on this.

 

2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”.

 

Good point.  To support this feature (which some others have also been requesting) will require defining an alias CSR for each hpmcounter CSR that is "User-Read-Write".

 

Having two User aliases of the same CSR is conceptually not pretty, but this is simple and seems like a necessary evil for supporting this feature.

 

Like above, we'll have to see if the interest in this feature is significant enough to warrant adding read/write hpmcounter aliases.

 

3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.

 

In my mind event_info simply fills in all the higher bits of mhpmevent that are not written by event_idx - which I believe was to be the default code path in the SBI PMU code.  (This, of course, applies for future implementations that choose to organize their mhpmevent registers in this simple manner.  Implementations are free to organize their mhpmevent CSR differently and supply corresponding implementation-specific SBI code.)  In other words (for RV64):

 

mhpmevent[19:16] = event_idx.type

mhpmevent[15:  0] = event_idx.code

mhpmevent[63:20] = event_info[43:0]

 

Greg

 

 

Regards,

Anup

 


Greg Favor
 

On Mon, Aug 3, 2020 at 10:57 PM Anup Patel <Anup.Patel@...> wrote:

Hi Greg,

 

We have SBI_PMU_COUNTER_START/STOP calls where the SBI implementation will update the MCOUNTINHIBIT bits.


That reduces the argument for bit [31].  I won't remove it yet (until I write up an updated proposal), but I imagine that bit will be dropped if no one else speaks up in support of it.  (Although if/when someone (such as us) supports hardware events starting and stopping counters, then we'll have to deal with the fact that this is a change to the current arch definition of the mcountinhibit CSR.)
 

The SBI_PMU_COUNTER_START call also take parameter for initial value of counter so we don’t need HPMCOUNTER CSR to be writeable in S-mode and this will also avoid alias CSRs.

 

I think we can remove BIT[28] in your proposal.


I'm OK with this.  It is other people that have been more desirous of this feature.  I agree that it would be cleaner and simpler to not have this bit, but let's see who speaks up for keeping this feature.

Greg
 

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Greg Favor
Sent: 04 August 2020 10:42
To: Anup Patel <Anup.Patel@...>
Cc: alankao <alankao@...>; tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Anup,

 

Thanks.  Comments below.

 

Greg

 

On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:

Hi Greg,

 

Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):

 

1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR

 

This is up in the air for inclusion or not in the proposal.  As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak.  Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared.  But with this Active bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there).

 

The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit.  For example, in response to a cross-trigger from the debug Trigger Module, e.g. to start counting when a certain instruction executed and to stop counting when another address is executed.  Or to start/stop counting in response to another counter overflowing after N occurrences of some event.  In essence, for counting more complex types of event conditions, particularly in debug scenarios and less so in straight perf mon scenarios.

 

Currently cross-trigger capabilities like these aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires.  And note that hardware manipulation of mcountinhibit bits would be a change to the architectural definition of mcountinhibit.  This isn't a forcing issue, but having this Active bit in mhpmevent sidesteps that issue.

 

But even with all this, it is still up in the air whether people want or don't want to standardize this separate counter control bit as part of a counter extension.  We'll see where people fall on this.

 

2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”.

 

Good point.  To support this feature (which some others have also been requesting) will require defining an alias CSR for each hpmcounter CSR that is "User-Read-Write".

 

Having two User aliases of the same CSR is conceptually not pretty, but this is simple and seems like a necessary evil for supporting this feature.

 

Like above, we'll have to see if the interest in this feature is significant enough to warrant adding read/write hpmcounter aliases.

 

3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.

 

In my mind event_info simply fills in all the higher bits of mhpmevent that are not written by event_idx - which I believe was to be the default code path in the SBI PMU code.  (This, of course, applies for future implementations that choose to organize their mhpmevent registers in this simple manner.  Implementations are free to organize their mhpmevent CSR differently and supply corresponding implementation-specific SBI code.)  In other words (for RV64):

 

mhpmevent[19:16] = event_idx.type

mhpmevent[15:  0] = event_idx.code

mhpmevent[63:20] = event_info[43:0]

 

Greg

 

 

Regards,

Anup

 


alankao
 

Hi all,

I am for bit[28].  I tried to defend the idea but unfortunately my reply goes private to Anup only, and the system doesn't leave any backup.  Please wait and see my comment once Anup replies.


Anup Patel
 

The most desired feature from a PMU is counting events in right-context (or right-mode). This is not clearly defined in RISC-V spec right now. Greg’s proposal already address this in a clean way by defining required bits in MHPMEVENT CSRs.

 

Other important feature expected from a PMU is reading counters without traps, this is already available because HPMCOUNTER CSRs are “User-Read-Only”.

 

Regarding HPMCOUNTER writes from S-mode, the Linux PMU drivers (across architectures) only writes PMU counter at time of configuring the counter. We anyway have SBI call to configure a RISC-V PMU counter because MHPMEVENT CSR is for M-mode only so it is better to initialize/write the MHPMCOUNTER CSR in M-mode at time of configuring the counter. Allowing S-mode to write HPMCOUNTER CSR is good but won’t benefit much. On the contrary, RISC-V hypervisors might end up save/restore more CSRs if HPMCOUTNER CSR is writeable from S-mode.

 

The code snippet mentioned below requires “#ifdef” which means we have to build Linux RISC-V image differently for doing CSR writes this way. This approach is not going to fly for distros because distros can’t release single Linux RISC-V image for all RISC-V hardware if we have such “#ifdef”.

 

Regards,

Anup

 

From: alankao <alankao@...>
Sent: 04 August 2020 12:44
To: Anup Patel <Anup.Patel@...>
Subject: Private: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

The bit[28], or the *mcounterwen* in my original proposal, is essential for better performance versus m-mode register access through SBI or trap-emulation. It is definitely helpful for a simple M-S-U RISC-V system.  How many cycles are you willing to give up just to update the M-mode register every time the kernel handles an HPM interrupt or a context switch?  I admit it may not be useful, or even harmful for M-H-S-U system, but it is obvious that a simple RISC-V machine running single OS kernel can largely benefit from it.

I suggest we kept it (either bit[28] or thing like our *mcounterwen*) as some WARL.  In Linux kernel, we can have an API, and the real implementation is like

riscv_update_hpm_event()
{
#ifdef CONFIG_RESTRICT_MREG_ACCESS
sbi_update_hpm_event(...);
#else
// that the bit[28]s have been enabled at initialization stage
csr_write(...);
#endif
}

Any comments? 


alankao
 

Hi Anup,

> The most desired feature from a PMU is counting events in right-context (or right-mode). This is not clearly defined in RISC-V spec right now. Greg’s proposal already address this in a clean way by defining required bits in MHPMEVENT CSRs.  Other important feature expected from a PMU is reading counters without traps, this is already available because HPMCOUNTER CSRs are “User-Read-Only”

 
Claiming some features as "most desired" is too subjective.  I agree that mode-specific counting is important, but for performance monitoring, the HPM interrupt is also essential.  Otherwise, sampling like `perf record` just doesn't work.

> Regarding HPMCOUNTER writes from S-mode, the Linux PMU drivers (across architectures) only writes PMU counter at time of configuring the counter. We anyway have SBI call to configure a RISC-V PMU counter because MHPMEVENT CSR is for M-mode only so it is better to initialize/write the MHPMCOUNTER CSR in M-mode at time of configuring the counter. Allowing S-mode to write HPMCOUNTER CSR is good but won’t benefit much. On the contrary, RISC-V hypervisors might end up save/restore more CSRs if HPMCOUTNER CSR is writeable from S-mode.

You are understating the case when you say "only writes counter at configuration time."  It sounds like the kernel seldom writes them.  The fact is, the counters and other registers need writing every time the corresponding process is being context-switch in and every time an HPM interrupt is being handled.  Would you like to elaborate more, based on what do you say writing counters from S-mode is good? based on what do you judge it won't benefit much?  

However, since RISC-V doesn't have any discussed features yet, I doubt that anyone has any quantitative data.  My proposal here is that I will take our existing solution (AndeStar V5 extension and perf-event port on Linux 4.17) as the testbed.  By default, we have *mcounterwen*, which is effectively equal to the bit[28] in Greg's proposal, to enable S-mode writing HPM CSRs. This is the treatment group.  Then we do a patch to transform all existing csr_write's to HPM CSRs (including counters) into SBI calls as the control group.  My anticipation of the result is that the wall clock time performing a sampling in the treatment group will be not just marginally shorter than the control group.

Meanwhile, I agree with your concern about H extension.  That's why I emphasized this feature is useful for M-S-U configuration and questionable for M-H-S-U one.

> The code snippet mentioned below requires “#ifdef” which means we have to build Linux RISC-V image differently for doing CSR writes this way. This approach is not going to fly for distros because distros can’t release single Linux RISC-V image for all RISC-V hardware if we have such “#ifdef”.

Each distro maintains its own priority of hundreds of thousands of kernel features, not to mention many nameless "distributions" released by different teams as their BSPs do the same thing.  The diversity of features is the reason that so many distributions rise and fall, compete and cooperate.  Therefore, what we should debate is not what distros that support RISC-V should do with this possible divergence: I am totally fine that this CONFIG_RESTRICT_MREG_ACESS is off by default!  Big ones like Fedora and Debian aim at Desktop or Server, and that's good.  What we should really debate here is the feature itself, if it is useful enough for some, not all, possible RISC-V machines that help to make people's lives easier.  

For the record, distributions can just release a single image that disables this feature by default.  That's their choice because they expect quite a ratio of it will run as a guest OS, and it should not enable the feature or there will be a lot more work in the hypervisor.  The image can still run on any RISC-V machines that either or not support bit[28]/mcounterwen.  You are not making a valid point here.

Best,
Alan


Anup Patel
 

HI Alan,

 

I never said HPM overflow interrupt is not important. The MHPMOVERFLOW CSR proposed by Greg is perfectly fine.

 

I think you missed my point regarding H-extension. If S-mode is allowed to directly write to HPMCOUNTER CSRs then for H-Extension we will need additional VSHPMCOUNTER CSRs to allow Hypervisor to context-switch. We can avoid lot of these CSRs by keeping HPMCOUNTER CSRs read-only for S-mode. The initialization/restoration of HPMCOUNTER value can be done SBI_PMU_COUNTER_START call and this integrates well with Linux PMU driver framework too. The Linux PMU driver framework only updates counter value in “add()” or “start()” callback. That’s why allow S-mode write HPMCOUNTER CSRs won’t provide much benefit.

 

Regarding single Linux RISC-V image for all platforms, this is a requirement from various distros and Linux RISC-V maintainers. We should avoid a kernel feature which needs to be explicitly enabled by users and distros keeping it disabled by default. The “#ifdef” based feature checking should be replaced by runtime feature checking based on device tree OR something else.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of alankao
Sent: 05 August 2020 07:47
To: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Hi Anup,

> The most desired feature from a PMU is counting events in right-context (or right-mode). This is not clearly defined in RISC-V spec right now. Greg’s proposal already address this in a clean way by defining required bits in MHPMEVENT CSRs.  Other important feature expected from a PMU is reading counters without traps, this is already available because HPMCOUNTER CSRs are “User-Read-Only”

 
Claiming some features as "most desired" is too subjective.  I agree that mode-specific counting is important, but for performance monitoring, the HPM interrupt is also essential.  Otherwise, sampling like `perf record` just doesn't work.

> Regarding HPMCOUNTER writes from S-mode, the Linux PMU drivers (across architectures) only writes PMU counter at time of configuring the counter. We anyway have SBI call to configure a RISC-V PMU counter because MHPMEVENT CSR is for M-mode only so it is better to initialize/write the MHPMCOUNTER CSR in M-mode at time of configuring the counter. Allowing S-mode to write HPMCOUNTER CSR is good but won’t benefit much. On the contrary, RISC-V hypervisors might end up save/restore more CSRs if HPMCOUTNER CSR is writeable from S-mode.

You are understating the case when you say "only writes counter at configuration time."  It sounds like the kernel seldom writes them.  The fact is, the counters and other registers need writing every time the corresponding process is being context-switch in and every time an HPM interrupt is being handled.  Would you like to elaborate more, based on what do you say writing counters from S-mode is good? based on what do you judge it won't benefit much?  

However, since RISC-V doesn't have any discussed features yet, I doubt that anyone has any quantitative data.  My proposal here is that I will take our existing solution (AndeStar V5 extension and perf-event port on Linux 4.17) as the testbed.  By default, we have *mcounterwen*, which is effectively equal to the bit[28] in Greg's proposal, to enable S-mode writing HPM CSRs. This is the treatment group.  Then we do a patch to transform all existing csr_write's to HPM CSRs (including counters) into SBI calls as the control group.  My anticipation of the result is that the wall clock time performing a sampling in the treatment group will be not just marginally shorter than the control group.

Meanwhile, I agree with your concern about H extension.  That's why I emphasized this feature3 is useful for M-S-U configuration and questionable for M-H-S-U one.

> The code snippet mentioned below requires “#ifdef” which means we have to build Linux RISC-V image differently for doing CSR writes this way. This approach is not going to fly for distros because distros can’t release single Linux RISC-V image for all RISC-V hardware if we have such “#ifdef”.

Each distro maintains its own priority of hundreds of thousands of kernel features, not to mention many nameless "distributions" released by different teams as their BSPs do the same thing.  The diversity of features is the reason that so many distributions rise and fall, compete and cooperate.  Therefore, what we should debate is not what distros that support RISC-V should do with this possible divergence: I am totally fine that this CONFIG_RESTRICT_MREG_ACESS is off by default!  Big ones like Fedora and Debian aim at Desktop or Server, and that's good.  What we should really debate here is the feature itself, if it is useful enough for some, not all, possible RISC-V machines that help to make people's lives easier.  

For the record, distributions can just release a single image that disables this feature by default.  That's their choice because they expect quite a ratio of it will run as a guest OS, and it should not enable the feature or there will be a lot more work in the hypervisor.  The image can still run on any RISC-V machines that either or not support bit[28]/mcounterwen.  You are not making a valid point here.

Best,
Alan