A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
Andy Glew Si5
I have NOT been working on a
RISC-V performance monitoring proposal, but I've been involved
with performance monitoring first as a user then is an architect
for many years and at several companies.
I would like to draw this group's attention to some features of
Intel x86 performance monitoring that turned out pretty
successful.
First, you're already talking about hardware performance
monitoring interrupts for statistical profiling. Good. A few
comments on that below.
But, I think one of the best bang for buck performance
monitoring features of Sadie six MIN/performance monitoringis the
performance counter event filtering
---+ Performance event filtering and transformation logic before counting Next, I think one of the best bang for buck performance monitoring features of x86 EMON performance monitoring is the performance counter event filtering. RISC-V has only the most primitive version of this. Every x86 performance counter has per counter event select logic. In addition to that logic, there is a mask that specifies what modes to cvount in - User, OS, hypervisor. I see that some of the Ri5 proposals also have that. Good. But more filtering is also provided: Each counter has a "Counter Mask" CMASK - really, a threshold. When non-zero, this is compared to the count of the selected event in any given cycle. If >= CMASK, the counter is incremented by 1; if less, no increment. => This comparison allows a HW event to be used to profile things like "Number of cycles in which 1, 2 or more, 3 or more ... events happened - e.g. how often you are able to to acheive superscalar execution. In vectors, it might count how many vector elements are masked in or out. If you have events that correspond to buffer occupancy, you can profile to see where the buffer is full or not. INV - a bit that allows the CMASK comparison to be inverted. => so that you can count event > threshold, and event < threshold. I would really have liked to have >, <, and == threshold. And also the ability to increment by 1 if exceeding threshold, or by the actual count that exceeds the threshold. The former allows you to find where you are getting good superscalar behavior or not, the latter allows you to determine the average when exceeding the threshold or not. When I do find this I had to save hardware. This masked comparison allows you to get more different types of events, for events that occur more than one per cycle. That's pretty good, abut it doesn't help you with scarce events, t events that only occur once every four or eight or an cycles. Later, Intel added what I call "push-out" profiling: when the comparison condition is met, e.g. when no instruction retires, a counter that increments one every clock cycle starts ; when the condition changes, the value of that counter is what is recorded, and naturally subject to all of the natural filtering. That was too much hardware for me to add in 1991, but it proved very useful. My first instinct is always to minimize hardware cost for performance monitoring hardware. The nice thing about the filtering logic at the performance counters is that it removed the hardware cost from the individual unit like the cache, and left it in the performance monitoring unit. (The question of centralized versus decentralized performance counters is always an issue. Suffice it to say that Intel P6 had for centralized performance counters, to save hardware; Pentium 4 went to a fully distributed performance counter architecture, but users hated it, so until return to the centralized model at least architecturally, although microarchitecture implementations might be decentralized ) More filtering logic: each count has an E, edge select bit. This counts when the condition described by the privilege level mask and CMASK comparison changes. Using such edge filtering, you can determine the average length of bursts, e.g. the average length of a period that you have not been able to execute any instruction, and so on. Simple filters can give you average lengths and occupancies; fancier and more expensive stuff is necessary to actually determine a distribution. Certain events themselves are sent to the performance counters as bitmasks. E.g. the utilization of the execution unit ports as a bitmask - on the original P6 { ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on modern machines. By controlling the UMASK field of the filter control logic for each performance counter, you could specify to count all instruction dispatches, or just loads, and so on. Changing the UMASK field allowed you to profile to find out which parts of the machine were being used and which not. (This proved successful enough to get me and the software guy who eventually started using it in achievement award.) If I were to do it over again I would have a generic POPCNT as part of the filter logic, as well as the comparison. Finally, simple filter stuff: INT - Performance interrupt enable PC - pin control - this predated me: it toggled an external pin when the performance counter overflowed. The very first EMON event sampling took that external pin and wired it back to the NMI pin of the CPU. Obviously, it is better to have internal logic for performance monitoring interrupts. Nevertheless, there is still a need for externally visible performance event sampling, e.g. freeze performance events outside the CPU, in the fabric, or in I/O devices. Exactly what those are is obviously implementation dependent, but it's still good to have a standard way of controlling such implementation dependent features. I call this the "pin architecture", and IMHO maintaining such system compatibility was as much a factor in Intel's success as instruction set architecture. ---+ Performance counter freeze There are always several performance counters. At least two per privilege level. At least a pair, so you can compute things like cashless rates and other ratios. But not necessarily dedicated to any specific privilege level, because that would be wasteful: you can study things a hell of a lot more quickly if you can use all of the performance counters, when other modes are not using them. When you have several performance counters, you are often measuring things together. You therefore need the ability to freeze them all at the same time. This means that you need to have all of the enable bits for all of the counters, or at least a subset, in the same CSR. If you want, the enable bit can be in both the per counter control registers and in a central CSR - i.e. there can be multiple views of the same bit indifferent CSRs. Performance analysts would really like the ability to freeze multiple CPUs performance counters at the same time. This is one motivation for that pin control signal ---+ Precise performance monitoring inputs - You can only wish! When you are doing performance counter event interrupt based sampling, it would be really nice if the interrupt occurred exactly at the instruction that had the event. If you can do that, great. However, it takes extra hardware to do that. Also, some events do not in any way correspond to a retired instruction - think events that occur on speculative instructions that never graduate/retire. Again, you can create special registers that record, say, the program counter of such a speculative instruction, but that is again extra hardware. IMHO there is zero chance that all implementations, particularly the lowest cost of limitations, will make all performance events precise. At the very least there should be a way of discovering whether an event is precise or not. Machine check architectures have the same issue. ---+ What should the priority of the hardware performance monitoring interrupts be? One of the best things I did for Intel was punt on this issue: because I was also the architect in charge of the APIC interrupt controller, I provided a special LVT interrupt register just for the performance monitoring interrupt. This allowed the performance monitoring interrupt to use all of the APIC features, all of those that made sense. For example, the performance monitoring interrupt that Linux uses is just the normal interrupt of some priority. But as I mentioned above, the very first usage of performance monitoring interrupts used NMI, and was therefore able to profile code that had interrupts blocked. The interrupt would be directed to SMM, the not very good baby virtual machine monitor, allowing performance monitoring to be done independent of the operating system. Very nice, when you don't have source code for the operating system. And so on. I can't remember, but it is possible that the interrupt could be directed to other processors other than the local processor. However, that would not subsume the externally visible pin control, because the hardware pin can be a lot less expensive a lot more precise than signaling and enter processor interrupt. I used a similar approach for machine check interrupts, which could also be directed to the operating system, NMI, SMM, hypervisor,… By the way: I think Greg Favor said that x86 is performance monitoring interrupts are level sensitive. That is not strictly speaking true: whether they are level sensitive or not is programmed into the AIPAC local vector table. You can make it all sensitive or edge triggered. Obviously, however, when there are multiple performance counters bearing the same interrupt, you need to know which counter overflowed. Hence the sticky bits that Greg noticed in the manuals. ---+ Fancier stuff The above is mostly about getting the most out of simple performance counters: providing filter logic so that you can get the most insight out of the limited number of events; providing enables in a central place so that you can freeze multiple counters at the same time; allowing the performance counter interrupts to be directed not just a different privilege levels but in different interrupt priorities including NMI, and possibly also external hardware. There's a lot more stuff that can be done to help performance monitoring. Unfortunately, I have always worked at a place where I had to reduce the performance monitoring hardware cost as much as possible. I am sure, however, that many of you are familiar with fancier performance monitoring features, such as + Histogram counting (allowing you to count distributions without making multiple runs) => the CMASK comparators allowing very simple form of this, assuming you have enough performance counters. Actual histogram counters can do this more cheaply. + Cycle attribution - defining performance events so that you can actually say things like "X% of cycles are spent waitying for memory". IMHO the single most important "advanced" performance monitoring feature is what I call "longitudinal profiling". AMD Instruction Based Sampling (IBS), DEC ProfileMe, ARM SPE (Statistical Profiling Extension). The basic idea is to set a bit on some randomly selected instruction package somewhere high up in the pipeline, e.g. yet instruction fetch, and then let that bit flow down the pipeline, sampling things as it goes. E.g. you might sample the past missed latency or address, or whether it produced a stall in interaction with a different marked instruction. This sort of profiling is quite expensive, e.g. requiring a bit in many places in the pipeline, as well as registers to record the sample data, but it provides a lot of insight: it can give you distributions and averages, it can tell you what interactions between instructions are causing problems. However, if RISC-V cannot yet afford to do longitudinal profiling, the performance counter filter logic that I described above is low hanging fruit, much cheaper. From: Alankao <alankao@...> Sent: Monday, July 20, 2020 5:43PM To: Tech-Privileged <tech-privileged@...> Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail client, so please
forgive my spam.
Hi Brian, > I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies > each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning. <elaborating> Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions. How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample. > Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a > bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via > a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different > register. I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?). The spec says they should be 64-bit width although obviously nobody implements nor remember that. > Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in > the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its > purpose. Which architecture/OS are you referring to here? Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise. I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like. Best, Alan I apologize for some of the language
errors that occur far too frequently in my email. I use speech
recognition much of the time, and far too often do not catch
misrecognition errors. This can be quite embarrassing, amusing,
and/or confusing. Typical errors are not spelling but homonyms,
words that sound the same - e.g. "cash" instead of "cache".
|
|
Andy Glew Si5
BTW, I have absolutely no idea
what changes would be necessary to the RISC-V HPM CSRs.
I do note however that the x86 control bits pretty much lived
in one CSR per counter. later versions added a few central
counters for things like having all of the enable bits in one
place you can get global freeze.
From: Andy Glew Si5 Via Lists.riscv.org <andy.glew=sifive.com@...> Sent: Tuesday, July 21, 2020 3:53PM To: Alankao <alankao@...>, Tech-Privileged <tech-privileged@...> Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
On 7/21/2020 3:53 PM, Andy Glew Si5 via
lists.riscv.org wrote:
I have NOT been working on a
RISC-V performance monitoring proposal, but I've been involved
with performance monitoring first as a user then is an architect
for many years and at several companies.
I would like to draw this group's attention to some features
of Intel x86 performance monitoring that turned out pretty
successful.
First, you're already talking about hardware performance
monitoring interrupts for statistical profiling. Good. A few
comments on that below.
But, I think one of the best bang for buck performance
monitoring features of Sadie six MIN/performance monitoringis
the performance counter event filtering
---+ Performance event filtering and transformation logic before counting Next, I think one of the best bang for buck performance monitoring features of x86 EMON performance monitoring is the performance counter event filtering. RISC-V has only the most primitive version of this. Every x86 performance counter has per counter event select logic. In addition to that logic, there is a mask that specifies what modes to cvount in - User, OS, hypervisor. I see that some of the Ri5 proposals also have that. Good. But more filtering is also provided: Each counter has a "Counter Mask" CMASK - really, a threshold. When non-zero, this is compared to the count of the selected event in any given cycle. If >= CMASK, the counter is incremented by 1; if less, no increment. => This comparison allows a HW event to be used to profile things like "Number of cycles in which 1, 2 or more, 3 or more ... events happened - e.g. how often you are able to to acheive superscalar execution. In vectors, it might count how many vector elements are masked in or out. If you have events that correspond to buffer occupancy, you can profile to see where the buffer is full or not. INV - a bit that allows the CMASK comparison to be inverted. => so that you can count event > threshold, and event < threshold. I would really have liked to have >, <, and == threshold. And also the ability to increment by 1 if exceeding threshold, or by the actual count that exceeds the threshold. The former allows you to find where you are getting good superscalar behavior or not, the latter allows you to determine the average when exceeding the threshold or not. When I do find this I had to save hardware. This masked comparison allows you to get more different types of events, for events that occur more than one per cycle. That's pretty good, abut it doesn't help you with scarce events, t events that only occur once every four or eight or an cycles. Later, Intel added what I call "push-out" profiling: when the comparison condition is met, e.g. when no instruction retires, a counter that increments one every clock cycle starts ; when the condition changes, the value of that counter is what is recorded, and naturally subject to all of the natural filtering. That was too much hardware for me to add in 1991, but it proved very useful. My first instinct is always to minimize hardware cost for performance monitoring hardware. The nice thing about the filtering logic at the performance counters is that it removed the hardware cost from the individual unit like the cache, and left it in the performance monitoring unit. (The question of centralized versus decentralized performance counters is always an issue. Suffice it to say that Intel P6 had for centralized performance counters, to save hardware; Pentium 4 went to a fully distributed performance counter architecture, but users hated it, so until return to the centralized model at least architecturally, although microarchitecture implementations might be decentralized ) More filtering logic: each count has an E, edge select bit. This counts when the condition described by the privilege level mask and CMASK comparison changes. Using such edge filtering, you can determine the average length of bursts, e.g. the average length of a period that you have not been able to execute any instruction, and so on. Simple filters can give you average lengths and occupancies; fancier and more expensive stuff is necessary to actually determine a distribution. Certain events themselves are sent to the performance counters as bitmasks. E.g. the utilization of the execution unit ports as a bitmask - on the original P6 { ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on modern machines. By controlling the UMASK field of the filter control logic for each performance counter, you could specify to count all instruction dispatches, or just loads, and so on. Changing the UMASK field allowed you to profile to find out which parts of the machine were being used and which not. (This proved successful enough to get me and the software guy who eventually started using it in achievement award.) If I were to do it over again I would have a generic POPCNT as part of the filter logic, as well as the comparison. Finally, simple filter stuff: INT - Performance interrupt enable PC - pin control - this predated me: it toggled an external pin when the performance counter overflowed. The very first EMON event sampling took that external pin and wired it back to the NMI pin of the CPU. Obviously, it is better to have internal logic for performance monitoring interrupts. Nevertheless, there is still a need for externally visible performance event sampling, e.g. freeze performance events outside the CPU, in the fabric, or in I/O devices. Exactly what those are is obviously implementation dependent, but it's still good to have a standard way of controlling such implementation dependent features. I call this the "pin architecture", and IMHO maintaining such system compatibility was as much a factor in Intel's success as instruction set architecture. ---+ Performance counter freeze There are always several performance counters. At least two per privilege level. At least a pair, so you can compute things like cashless rates and other ratios. But not necessarily dedicated to any specific privilege level, because that would be wasteful: you can study things a hell of a lot more quickly if you can use all of the performance counters, when other modes are not using them. When you have several performance counters, you are often measuring things together. You therefore need the ability to freeze them all at the same time. This means that you need to have all of the enable bits for all of the counters, or at least a subset, in the same CSR. If you want, the enable bit can be in both the per counter control registers and in a central CSR - i.e. there can be multiple views of the same bit indifferent CSRs. Performance analysts would really like the ability to freeze multiple CPUs performance counters at the same time. This is one motivation for that pin control signal ---+ Precise performance monitoring inputs - You can only wish! When you are doing performance counter event interrupt based sampling, it would be really nice if the interrupt occurred exactly at the instruction that had the event. If you can do that, great. However, it takes extra hardware to do that. Also, some events do not in any way correspond to a retired instruction - think events that occur on speculative instructions that never graduate/retire. Again, you can create special registers that record, say, the program counter of such a speculative instruction, but that is again extra hardware. IMHO there is zero chance that all implementations, particularly the lowest cost of limitations, will make all performance events precise. At the very least there should be a way of discovering whether an event is precise or not. Machine check architectures have the same issue. ---+ What should the priority of the hardware performance monitoring interrupts be? One of the best things I did for Intel was punt on this issue: because I was also the architect in charge of the APIC interrupt controller, I provided a special LVT interrupt register just for the performance monitoring interrupt. This allowed the performance monitoring interrupt to use all of the APIC features, all of those that made sense. For example, the performance monitoring interrupt that Linux uses is just the normal interrupt of some priority. But as I mentioned above, the very first usage of performance monitoring interrupts used NMI, and was therefore able to profile code that had interrupts blocked. The interrupt would be directed to SMM, the not very good baby virtual machine monitor, allowing performance monitoring to be done independent of the operating system. Very nice, when you don't have source code for the operating system. And so on. I can't remember, but it is possible that the interrupt could be directed to other processors other than the local processor. However, that would not subsume the externally visible pin control, because the hardware pin can be a lot less expensive a lot more precise than signaling and enter processor interrupt. I used a similar approach for machine check interrupts, which could also be directed to the operating system, NMI, SMM, hypervisor,… By the way: I think Greg Favor said that x86 is performance monitoring interrupts are level sensitive. That is not strictly speaking true: whether they are level sensitive or not is programmed into the AIPAC local vector table. You can make it all sensitive or edge triggered. Obviously, however, when there are multiple performance counters bearing the same interrupt, you need to know which counter overflowed. Hence the sticky bits that Greg noticed in the manuals. ---+ Fancier stuff The above is mostly about getting the most out of simple performance counters: providing filter logic so that you can get the most insight out of the limited number of events; providing enables in a central place so that you can freeze multiple counters at the same time; allowing the performance counter interrupts to be directed not just a different privilege levels but in different interrupt priorities including NMI, and possibly also external hardware. There's a lot more stuff that can be done to help performance monitoring. Unfortunately, I have always worked at a place where I had to reduce the performance monitoring hardware cost as much as possible. I am sure, however, that many of you are familiar with fancier performance monitoring features, such as + Histogram counting (allowing you to count distributions without making multiple runs) => the CMASK comparators allowing very simple form of this, assuming you have enough performance counters. Actual histogram counters can do this more cheaply. + Cycle attribution - defining performance events so that you can actually say things like "X% of cycles are spent waitying for memory". IMHO the single most important "advanced" performance monitoring feature is what I call "longitudinal profiling". AMD Instruction Based Sampling (IBS), DEC ProfileMe, ARM SPE (Statistical Profiling Extension). The basic idea is to set a bit on some randomly selected instruction package somewhere high up in the pipeline, e.g. yet instruction fetch, and then let that bit flow down the pipeline, sampling things as it goes. E.g. you might sample the past missed latency or address, or whether it produced a stall in interaction with a different marked instruction. This sort of profiling is quite expensive, e.g. requiring a bit in many places in the pipeline, as well as registers to record the sample data, but it provides a lot of insight: it can give you distributions and averages, it can tell you what interactions between instructions are causing problems. However, if RISC-V cannot yet afford to do longitudinal profiling, the performance counter filter logic that I described above is low hanging fruit, much cheaper. From: Alankao <alankao@...> Sent: Monday, July 20, 2020 5:43PM To: Tech-Privileged <tech-privileged@...> Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail client, so please
forgive my spam.
Hi Brian, > I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies > each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning. <elaborating> Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions. How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample. > Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a > bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via > a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different > register. I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?). The spec says they should be 64-bit width although obviously nobody implements nor remember that. > Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in > the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its > purpose. Which architecture/OS are you referring to here? Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise. I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like. Best, Alan I apologize for some of the language
errors that occur far too frequently in my email. I use speech
recognition much of the time, and far too often do not catch
misrecognition errors. This can be quite embarrassing, amusing,
and/or confusing. Typical errors are not spelling but homonyms,
words that sound the same - e.g. "cash" instead of "cache".
I apologize for some of the language
errors that occur far too frequently in my email. I use speech
recognition much of the time, and far too often do not catch
misrecognition errors. This can be quite embarrassing, amusing,
and/or confusing. Typical errors are not spelling but homonyms,
words that sound the same - e.g. "cash" instead of "cache".
|
|
Hi Andy,
Thank you for the hints as an Intel PMU architect. My question is about the mode selection part as below. It is not difficult to implement such a mechanism that an event should only be counted in some privileged modes. Both Greg's and my approach can achieve this. But in practice, we found profiling higher-privileged modes has some problems. Under basic Unix-like RISC-V configuration, the kernel runs in S-mode and there is M-mode for platform-specific stuff. Says we now want to sample M-mode software. The first implement decision is which mode the HPM interrupt should go. Everything can be more controllable if the interrupt can just go S-mode, but obviously there is no easy way for S-mode software, the kernel, to read general M-mode information like mepc (Machine Exception Program Counter) register. The other route goes to M-mode, but since RISC-V HPM interrupt has never been seriously/publicly discussed until this thread, the effort so far including current PMU SBI extension proposal did not address this. I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues? |
|
Andy Glew Si5
x86's hardware performance monitoring interrupt is delivered
to whatever is specified by the local APIC's LVT (Local Vector
Table) entry for performance monitoring. This gets it out of
being a special case, and just makes it like any other interrupt.
The hypervisor or virtual machine manager has to be able to handle
any interrupt appropriately. in a simple virtual machine
architecture, such as the initial version of Intel VT that I
worked on, all external interrupts go to the hypervisor, and
then the hypervisor can decide if it wants to deliver them to a
guest privilege level. Fancier virtual machine architectures such
as current Intel allow certain interrupts to be sent directly to
the guest, without being caught by the hypervisor first.
There should not be any special handling for hardware
performance monitoring interests. It should be just like any
other interrupt or exception. There should be a uniform
delegation architecture for all interests and traps. Eliminate
as many special cases as possible.
For any given interrupt or exception, sometimes you wanted to
go straight to the hypervisor, sometimes you wanted to go straight
to the guest..
I say "hypervisor" here, but it might just as well be M-mode:
or generalize, sometimes you want it to go to the most privileged
software level, sometimes to the least, sometimes one of the
privileged software levels in between. The interrupt architecture
should support that.
--
There's a bit of funkiness with respect to precise performance
monitoring exceptions just like there is for machine check. If you
go through a complicated interrupt vectoring mechanism, it may
become difficult to be precise. In fact, that's one of the
reasons why P6's original performance monitoring interrupt
forcing precise (that, and the fact that it took several cycles
to propagate from the unit where the event occurred to the
performance counter logic, and not even a uniform number cycles -
there was a tree of wires with differing numbers of latches on
different paths).
But that is okay-ish: You can either have an interlock to
prevent more instructions from retiring after the instruction
where the precise performance monitor event has occurred.
taking care to avoid deadlock, e.g. taking care that a higher
priority interrupt can preempt while that interlock is in flight.
Or you can add the mechanisms To provide appropriate sampling
when interrupts are actually imprecise. Or, you can add a new
interrupt/exception delivery mechanism but basically does the
first thing, but throw it out some of the complexity of your
legacy trip delivery mechanism. It's microarchitecture.
By the way, if your performance counter takes more than one
cycle to propagate carry and detect overflow, you need such an
interlock anyway. That is not a common problem, but at Intel
circa 2000 we regularly imagined a "fireball" OOO core that ran
out of frequencies such that you could only propagate a carry
across 16 bits at a time; if you are wire limited rather than
logic limited, for cycles for 64-bit counter. In fact, I believe
that Intel PEBS (Precise Event Based Sampling) does not actually
sample when the counter overflows; instead, when the counter
overflows, it sets a bit that says the next time this event is
recognized, then generate the interrupt. Which, if you think about
it, is actually imprecise, if more than one event occurs in any
given cycle.
However, propagating through the interrupt delivery logic
probably takes more cycles than propagating a carry 64-bits.
From: Alankao <alankao@...> Sent: Tuesday, July 21, 2020 4:40PM To: Tech-Privileged <tech-privileged@...> Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
On 7/21/2020 4:40 PM, alankao wrote:
Hi Andy,
Thank you for the hints as an Intel PMU architect. My question is about the mode selection part as below. It is not difficult to implement such a mechanism that an event should only be counted in some privileged modes. Both Greg's and my approach can achieve this. But in practice, we found profiling higher-privileged modes has some problems. Under basic Unix-like RISC-V configuration, the kernel runs in S-mode and there is M-mode for platform-specific stuff. Says we now want to sample M-mode software. The first implement decision is which mode the HPM interrupt should go. Everything can be more controllable if the interrupt can just go S-mode, but obviously there is no easy way for S-mode software, the kernel, to read general M-mode information like mepc (Machine Exception Program Counter) register. The other route goes to M-mode, but since RISC-V HPM interrupt has never been seriously/publicly discussed until this thread, the effort so far including current PMU SBI extension proposal did not address this. I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues? I apologize for some of the language
errors that occur far too frequently in my email. I use speech
recognition much of the time, and far too often do not catch
misrecognition errors. This can be quite embarrassing, amusing,
and/or confusing. Typical errors are not spelling but homonyms,
words that sound the same - e.g. "cash" instead of "cache".
|
|
Andy Glew Si5
I meant "that's one of the reasons
why P6's original performance monitoring interrupt was imprecise"
darn that speech misrecognition: sometimes it inverts the
meaning of what I say. :-(
From: Andy Glew Si5 Via Lists.riscv.org <andy.glew=sifive.com@...> Sent: Tuesday, July 21, 2020 5:30PM To: Alankao <alankao@...>, Tech-Privileged <tech-privileged@...> Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
On 7/21/2020 5:30 PM, Andy Glew Si5 via
lists.riscv.org wrote:
x86's hardware performance monitoring interrupt is
delivered to whatever is specified by the local APIC's LVT
(Local Vector Table) entry for performance monitoring. This
gets it out of being a special case, and just makes it like any
other interrupt. The hypervisor or virtual machine manager has
to be able to handle any interrupt appropriately. in a simple
virtual machine architecture, such as the initial version of
Intel VT that I worked on, all external interrupts go to the
hypervisor, and then the hypervisor can decide if it wants to
deliver them to a guest privilege level. Fancier virtual machine
architectures such as current Intel allow certain interrupts to
be sent directly to the guest, without being caught by the
hypervisor first.
There should not be any special handling for hardware
performance monitoring interests. It should be just like any
other interrupt or exception. There should be a uniform
delegation architecture for all interests and traps. Eliminate
as many special cases as possible.
For any given interrupt or exception, sometimes you wanted to
go straight to the hypervisor, sometimes you wanted to go
straight to the guest..
I say "hypervisor" here, but it might just as well be
M-mode: or generalize, sometimes you want it to go to the most
privileged software level, sometimes to the least, sometimes one
of the privileged software levels in between. The interrupt
architecture should support that.
--
There's a bit of funkiness with respect to precise
performance monitoring exceptions just like there is for machine
check. If you go through a complicated interrupt vectoring
mechanism, it may become difficult to be precise. In fact,
that's one of the reasons why P6's original performance
monitoring interrupt forcing precise (that, and the fact that
it took several cycles to propagate from the unit where the
event occurred to the performance counter logic, and not even a
uniform number cycles - there was a tree of wires with
differing numbers of latches on different paths).
But that is okay-ish: You can either have an interlock to
prevent more instructions from retiring after the instruction
where the precise performance monitor event has occurred.
taking care to avoid deadlock, e.g. taking care that a higher
priority interrupt can preempt while that interlock is in
flight. Or you can add the mechanisms To provide appropriate
sampling when interrupts are actually imprecise. Or, you can add
a new interrupt/exception delivery mechanism but basically does
the first thing, but throw it out some of the complexity of your
legacy trip delivery mechanism. It's microarchitecture.
By the way, if your performance counter takes more than one
cycle to propagate carry and detect overflow, you need such an
interlock anyway. That is not a common problem, but at Intel
circa 2000 we regularly imagined a "fireball" OOO core that
ran out of frequencies such that you could only propagate a
carry across 16 bits at a time; if you are wire limited rather
than logic limited, for cycles for 64-bit counter. In fact, I
believe that Intel PEBS (Precise Event Based Sampling) does not
actually sample when the counter overflows; instead, when the
counter overflows, it sets a bit that says the next time this
event is recognized, then generate the interrupt. Which, if you
think about it, is actually imprecise, if more than one event
occurs in any given cycle.
However, propagating through the interrupt delivery logic
probably takes more cycles than propagating a carry 64-bits.
From: Alankao <alankao@...> Sent: Tuesday, July 21, 2020 4:40PM To: Tech-Privileged <tech-privileged@...> Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
On 7/21/2020 4:40 PM, alankao wrote:
Hi Andy,
Thank you for the hints as an Intel PMU architect. My question is about the mode selection part as below. It is not difficult to implement such a mechanism that an event should only be counted in some privileged modes. Both Greg's and my approach can achieve this. But in practice, we found profiling higher-privileged modes has some problems. Under basic Unix-like RISC-V configuration, the kernel runs in S-mode and there is M-mode for platform-specific stuff. Says we now want to sample M-mode software. The first implement decision is which mode the HPM interrupt should go. Everything can be more controllable if the interrupt can just go S-mode, but obviously there is no easy way for S-mode software, the kernel, to read general M-mode information like mepc (Machine Exception Program Counter) register. The other route goes to M-mode, but since RISC-V HPM interrupt has never been seriously/publicly discussed until this thread, the effort so far including current PMU SBI extension proposal did not address this. I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues? I apologize for some of the language
errors that occur far too frequently in my email. I use speech
recognition much of the time, and far too often do not catch
misrecognition errors. This can be quite embarrassing, amusing,
and/or confusing. Typical errors are not spelling but homonyms,
words that sound the same - e.g. "cash" instead of "cache".
I apologize for some of the language
errors that occur far too frequently in my email. I use speech
recognition much of the time, and far too often do not catch
misrecognition errors. This can be quite embarrassing, amusing,
and/or confusing. Typical errors are not spelling but homonyms,
words that sound the same - e.g. "cash" instead of "cache".
|
|
Greg Favor
A general comment about all the fancy forms of event filtering that can be nice to have: The most basic one of general applicability is mode-specific filtering. Past that one could try to define some further general filtering capabilities that aren't too specialized, but one quickly gets into having interesting filtering features specific to an event or class of events. We take the view (in our design) that the lower 16 bits of mhpmevent are used for event selection in the broad sense of the word. For us, the low 8 bits select between one of 256 "types" of events and then the upper 8 bits provide pretty flexible event-specific filtering. That has turned out to very nicely support a very large variety of events in a very manageable way hardware-wise - in contrast to having many hundreds (or more) individual events. But that is just our own implementation of event selection. My point is that someone else can do similar or not so similar things in their own design with whatever types of general or event-specific filtering features that they might desire. Trying to standardize that stuff can be tricky to say the least. For now, at least, I think we should just let people decide what events and event filtering they feel are valuable in their designs. We should only try to standardize any filtering features that are broadly applicable and valuable to have. For myself, that results in just proposing mode-specific event filtering. Greg On Tue, Jul 21, 2020 at 3:53 PM Andy Glew Si5 <andy.glew@...> wrote:
|
|
Andy Glew Si5
You seem to be missing the whole
point: the x86 PerfMon/ EMON event filtering is generic.
Ditto for Intel: in fact, x86 PERFEVTSEL has precisely an
8-bit field to select which type of event. That may have
increased, although it still seems to be only eight bits inside
the current manuals that I just downloaded.
Similarly, Intel PERFEVTSEL has an 8 bit UMASK for event
specific filtering.
However, there are further fields that define generic event
filtering. filters that you get for free, without having to design
them on a per "event type" basis.
(if you care about implementation, the UMASK field stands for
"Unit Mask" and is propagated to whatever hardware unit is
actually performing the measurement. the other filter bits live
at performance counter, and therefore apply to all events.)
The CMASK comparison is applicable to, and relevant to, any
event that can increment by more than one per clock cycle.
The E edge trigger is relevant to, and applicable to, any
event that occurs in bursts of back-to-back events in adjacent
clock cycles.
Ditto with Intel. A small number of events, filtered and
transformed in several different ways.
The main difference, is that all of your filtering is event
specific, and therefore you can't write portable code the takes
advantage of it. Whereas most of the Intel filtering is. So you
can write portable code the takes advantage of it.
There is, of course, some event specific filtering. I also
observed patterns in that event specific filtering that I think
would be quite usefully standardized: like that part about
masking different port combinations. however, that is not quite
as generic as a comparison threshold that applies to every event
like increment by more than one in a clock cycle.
The "push-out" profiling feature could also be generic,
counting the length of intervals in which no event occurs, for
any event. I did not do this in P6 because push-out profiling
requires an extra counter, even if it's only a few bits like six.
Similarly, you could tweak the edge detect filter to smooth
over a few clock cycles. again, that would be generic.
From: Greg Favor <gfavor@...> Sent: Tuesday, July 21, 2020 6:16PM To: Andy Glew <andy.glew@...> Cc: Alankao <alankao@...>, Tech-Privileged <tech-privileged@...> Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
On 7/21/2020 6:16 PM, Greg Favor wrote:
A general comment about all the fancy forms of
event filtering that can be nice to have:
The most basic one of general applicability is
mode-specific filtering. Past that one could try to define
some further general filtering capabilities that aren't too
specialized, but one quickly gets into having interesting
filtering features specific to an event or class of events.
We take the view (in our design) that the lower 16 bits of
mhpmevent are used for event selection in the broad sense of
the word. For us, the low 8 bits select between one of 256
"types" of events and then the upper 8 bits provide pretty
flexible event-specific filtering. That has turned out
to very nicely support a very large variety of events in a
very manageable way hardware-wise - in contrast to having many
hundreds (or more) individual events. But that is just our
own implementation of event selection.
My point is that someone else can do similar or not so
similar things in their own design with whatever types of
general or event-specific filtering features that they might
desire. Trying to standardize that stuff can be tricky to say
the least.
For now, at least, I think we should just let people decide
what events and event filtering they feel are valuable in
their designs. We should only try to standardize any
filtering features that are broadly applicable and valuable to
have. For myself, that results in just proposing
mode-specific event filtering.
Greg
On Tue, Jul 21, 2020 at 3:53
PM Andy Glew Si5 <andy.glew@...> wrote:
I apologize for some of the language
errors that occur far too frequently in my email. I use speech
recognition much of the time, and far too often do not catch
misrecognition errors. This can be quite embarrassing, amusing,
and/or confusing. Typical errors are not spelling but homonyms,
words that sound the same - e.g. "cash" instead of "cache".
|
|
Hi all,
Although there were some non-resolved discussions, it has little to do with what we should do for the next step. I believe Greg's proposal is superior to the original one in the starting thread because 1. It reuses `hpmevents` for most of the functions that we all agree that RISC-V needs, instead of adding a bunch of new registers. 2. It is H-ext-aware I suggest Greg take the lead to start a PR in the ISA Repo, I can help review and evaluate the effort to patch existing software. Thanks, Alan |
|
Greg Favor
Alan, I'm fine with taking the lead on this architecture extension. But it should follow a proper process as directed by the TSC. Thus far this would mean getting a new TG created or doing something less formally under an existing TG. But for smaller extension proposals like this there is need for a proper lighter weight and faster process. Need for this is recognized and I suspect will probably be promulgated by the TSC some time soon. So I suggest we pause for a short bit, and then see if we can follow that expedited process once it is available. In the meantime I/we can prepare what we can in advance. (I don't think this will represent a material slow down to getting to a frozen spec and then to ratification.) Greg On Wed, Jul 29, 2020 at 5:24 PM alankao <alankao@...> wrote: Hi all, |
|
Hi Greg,
Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):
1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR 2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”. 3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.
Regards, Anup
From: tech-privileged@... <tech-privileged@...>
On Behalf Of Greg Favor
Sent: 30 July 2020 12:57 To: alankao <alankao@...> Cc: tech-privileged@... Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
Alan,
I'm fine with taking the lead on this architecture extension. But it should follow a proper process as directed by the TSC. Thus far this would mean getting a new TG created or doing something less formally under an existing TG. But for smaller extension proposals like this there is need for a proper lighter weight and faster process. Need for this is recognized and I suspect will probably be promulgated by the TSC some time soon.
So I suggest we pause for a short bit, and then see if we can follow that expedited process once it is available. In the meantime I/we can prepare what we can in advance. (I don't think this will represent a material slow down to getting to a frozen spec and then to ratification.)
Greg
On Wed, Jul 29, 2020 at 5:24 PM alankao <alankao@...> wrote:
|
|
Greg Favor
Anup, Thanks. Comments below. Greg On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:
This is up in the air for inclusion or not in the proposal. As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak. Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared. But with this bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there). The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit. For example, in response to a cross-trigger from the debug Trigger Module (e.g. to start counting when a certain instruction executed and to stop counting when another address is executed). Or to start/stop counting in response to another counter overflowing after N occurrences of some event. Currently cross-trigger capabilities like this aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires.
|
|
Greg Favor
Email accidentally sent early. Let me finish the email and then I'll send it again. Greg
|
|
Greg Favor
Anup, Thanks. Comments below. Greg On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:
This is up in the air for inclusion or not in the proposal. As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak. Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared. But with this Active bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there). The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit. For example, in response to a cross-trigger from the debug Trigger Module, e.g. to start counting when a certain instruction executed and to stop counting when another address is executed. Or to start/stop counting in response to another counter overflowing after N occurrences of some event. In essence, for counting more complex types of event conditions, particularly in debug scenarios and less so in straight perf mon scenarios. Currently cross-trigger capabilities like these aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires. And note that hardware manipulation of mcountinhibit bits would be a change to the architectural definition of mcountinhibit. This isn't a forcing issue, but having this Active bit in mhpmevent sidesteps that issue. But even with all this, it is still up in the air whether people want or don't want to standardize this separate counter control bit as part of a counter extension. We'll see where people fall on this.
Good point. To support this feature (which some others have also been requesting) will require defining an alias CSR for each hpmcounter CSR that is "User-Read-Write". Having two User aliases of the same CSR is conceptually not pretty, but this is simple and seems like a necessary evil for supporting this feature. Like above, we'll have to see if the interest in this feature is significant enough to warrant adding read/write hpmcounter aliases.
In my mind event_info simply fills in all the higher bits of mhpmevent that are not written by event_idx - which I believe was to be the default code path in the SBI PMU code. (This, of course, applies for future implementations that choose to organize their mhpmevent registers in this simple manner. Implementations are free to organize their mhpmevent CSR differently and supply corresponding implementation-specific SBI code.) In other words (for RV64): mhpmevent[19:16] = event_idx.type mhpmevent[15: 0] = event_idx.code mhpmevent[63:20] = event_info[43:0] Greg
|
|
Hi Greg,
We have SBI_PMU_COUNTER_START/STOP calls where the SBI implementation will update the MCOUNTINHIBIT bits. The SBI_PMU_COUNTER_START call also take parameter for initial value of counter so we don’t need HPMCOUNTER CSR to be writeable in S-mode and this will also avoid alias CSRs.
I think we can remove BIT[28] in your proposal.
Regards, Anup
From: tech-privileged@... <tech-privileged@...>
On Behalf Of Greg Favor
Sent: 04 August 2020 10:42 To: Anup Patel <Anup.Patel@...> Cc: alankao <alankao@...>; tech-privileged@... Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
Anup,
Thanks. Comments below.
Greg
On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:
This is up in the air for inclusion or not in the proposal. As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak. Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared. But with this Active bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there).
The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit. For example, in response to a cross-trigger from the debug Trigger Module, e.g. to start counting when a certain instruction executed and to stop counting when another address is executed. Or to start/stop counting in response to another counter overflowing after N occurrences of some event. In essence, for counting more complex types of event conditions, particularly in debug scenarios and less so in straight perf mon scenarios.
Currently cross-trigger capabilities like these aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires. And note that hardware manipulation of mcountinhibit bits would be a change to the architectural definition of mcountinhibit. This isn't a forcing issue, but having this Active bit in mhpmevent sidesteps that issue.
But even with all this, it is still up in the air whether people want or don't want to standardize this separate counter control bit as part of a counter extension. We'll see where people fall on this.
Good point. To support this feature (which some others have also been requesting) will require defining an alias CSR for each hpmcounter CSR that is "User-Read-Write".
Having two User aliases of the same CSR is conceptually not pretty, but this is simple and seems like a necessary evil for supporting this feature.
Like above, we'll have to see if the interest in this feature is significant enough to warrant adding read/write hpmcounter aliases.
In my mind event_info simply fills in all the higher bits of mhpmevent that are not written by event_idx - which I believe was to be the default code path in the SBI PMU code. (This, of course, applies for future implementations that choose to organize their mhpmevent registers in this simple manner. Implementations are free to organize their mhpmevent CSR differently and supply corresponding implementation-specific SBI code.) In other words (for RV64):
mhpmevent[19:16] = event_idx.type mhpmevent[15: 0] = event_idx.code mhpmevent[63:20] = event_info[43:0]
Greg
|
|
Hi Greg,
No issues with Bit[31] of your proposed MHPMEVENT definition. The SBI_PMU_COUNTER_START/STOP calls can either update MCOUNTINHIBIT Bits or these calls can update Bit[31] of appropriate MHPMEVENT CSR.
Regarding Bit[28], I agree with you. Let’s wait for more comments.
Regards, Anup
From: Greg Favor <gfavor@...>
Sent: 04 August 2020 11:40 To: Anup Patel <Anup.Patel@...> Cc: alankao <alankao@...>; tech-privileged@... Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
On Mon, Aug 3, 2020 at 10:57 PM Anup Patel <Anup.Patel@...> wrote:
That reduces the argument for bit [31]. I won't remove it yet (until I write up an updated proposal), but I imagine that bit will be dropped if no one else speaks up in support of it. (Although if/when someone (such as us) supports hardware events starting and stopping counters, then we'll have to deal with the fact that this is a change to the current arch definition of the mcountinhibit CSR.)
I'm OK with this. It is other people that have been more desirous of this feature. I agree that it would be cleaner and simpler to not have this bit, but let's see who speaks up for keeping this feature.
Greg
|
|
Greg Favor
On Mon, Aug 3, 2020 at 10:57 PM Anup Patel <Anup.Patel@...> wrote:
That reduces the argument for bit [31]. I won't remove it yet (until I write up an updated proposal), but I imagine that bit will be dropped if no one else speaks up in support of it. (Although if/when someone (such as us) supports hardware events starting and stopping counters, then we'll have to deal with the fact that this is a change to the current arch definition of the mcountinhibit CSR.)
I'm OK with this. It is other people that have been more desirous of this feature. I agree that it would be cleaner and simpler to not have this bit, but let's see who speaks up for keeping this feature. Greg
|
|
Hi all,
I am for bit[28]. I tried to defend the idea but unfortunately my reply goes private to Anup only, and the system doesn't leave any backup. Please wait and see my comment once Anup replies. |
|
The most desired feature from a PMU is counting events in right-context (or right-mode). This is not clearly defined in RISC-V spec right now. Greg’s proposal already address this in a clean way by defining required bits in MHPMEVENT CSRs.
Other important feature expected from a PMU is reading counters without traps, this is already available because HPMCOUNTER CSRs are “User-Read-Only”.
Regarding HPMCOUNTER writes from S-mode, the Linux PMU drivers (across architectures) only writes PMU counter at time of configuring the counter. We anyway have SBI call to configure a RISC-V PMU counter because MHPMEVENT CSR is for M-mode only so it is better to initialize/write the MHPMCOUNTER CSR in M-mode at time of configuring the counter. Allowing S-mode to write HPMCOUNTER CSR is good but won’t benefit much. On the contrary, RISC-V hypervisors might end up save/restore more CSRs if HPMCOUTNER CSR is writeable from S-mode.
The code snippet mentioned below requires “#ifdef” which means we have to build Linux RISC-V image differently for doing CSR writes this way. This approach is not going to fly for distros because distros can’t release single Linux RISC-V image for all RISC-V hardware if we have such “#ifdef”.
Regards, Anup
From: alankao <alankao@...>
Sent: 04 August 2020 12:44 To: Anup Patel <Anup.Patel@...> Subject: Private: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
The bit[28], or the *mcounterwen* in my original proposal, is essential for better performance versus m-mode register access through SBI or trap-emulation. It is definitely helpful for a simple M-S-U RISC-V system. How many cycles are
you willing to give up just to update the M-mode register every time the kernel handles an HPM interrupt or a context switch? I admit it may not be useful, or even harmful for M-H-S-U system, but it is obvious that a simple RISC-V machine running single OS
kernel can largely benefit from it. |
|
Hi Anup,
> The most desired feature from a PMU is counting events in right-context (or right-mode). This is not clearly defined in RISC-V spec right now. Greg’s proposal already address this in a clean way by defining required bits in MHPMEVENT CSRs. Other important feature expected from a PMU is reading counters without traps, this is already available because HPMCOUNTER CSRs are “User-Read-Only” |
|
HI Alan,
I never said HPM overflow interrupt is not important. The MHPMOVERFLOW CSR proposed by Greg is perfectly fine.
I think you missed my point regarding H-extension. If S-mode is allowed to directly write to HPMCOUNTER CSRs then for H-Extension we will need additional VSHPMCOUNTER CSRs to allow Hypervisor to context-switch. We can avoid lot of these CSRs by keeping HPMCOUNTER CSRs read-only for S-mode. The initialization/restoration of HPMCOUNTER value can be done SBI_PMU_COUNTER_START call and this integrates well with Linux PMU driver framework too. The Linux PMU driver framework only updates counter value in “add()” or “start()” callback. That’s why allow S-mode write HPMCOUNTER CSRs won’t provide much benefit.
Regarding single Linux RISC-V image for all platforms, this is a requirement from various distros and Linux RISC-V maintainers. We should avoid a kernel feature which needs to be explicitly enabled by users and distros keeping it disabled by default. The “#ifdef” based feature checking should be replaced by runtime feature checking based on device tree OR something else.
Regards, Anup
From: tech-privileged@... <tech-privileged@...>
On Behalf Of alankao
Sent: 05 August 2020 07:47 To: tech-privileged@... Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)
Hi Anup,
|
|