Date   

Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

A general comment about all the fancy forms of event filtering that can be nice to have:

The most basic one of general applicability is mode-specific filtering.  Past that one could try to define some further general filtering capabilities that aren't too specialized, but one quickly gets into having interesting filtering features specific to an event or class of events.

We take the view (in our design) that the lower 16 bits of mhpmevent are used for event selection in the broad sense of the word.  For us, the low 8 bits select between one of 256 "types" of events and then the upper 8 bits provide pretty flexible event-specific filtering.  That has turned out to very nicely support a very large variety of events in a very manageable way hardware-wise - in contrast to having many hundreds (or more) individual events.  But that is just our own implementation of event selection.

My point is that someone else can do similar or not so similar things in their own design with whatever types of general or event-specific filtering features that they might desire.  Trying to standardize that stuff can be tricky to say the least.

For now, at least, I think we should just let people decide what events and event filtering they feel are valuable in their designs.  We should only try to standardize any filtering features that are broadly applicable and valuable to have.  For myself, that results in just proposing mode-specific event filtering.

Greg


On Tue, Jul 21, 2020 at 3:53 PM Andy Glew Si5 <andy.glew@...> wrote:
I have NOT been working on a RISC-V performance monitoring proposal, but I've been involved with performance monitoring first as a user then is an architect for many years and at several companies.

I would like to draw this group's attention to some features of Intel x86 performance monitoring  that turned out pretty successful.

First, you're already talking about hardware performance monitoring interrupts for statistical profiling. Good. A few comments on that below.

But, I think one of the best bang for buck performance monitoring features of  Sadie six MIN/performance monitoringis the performance counter event filtering

---+ Performance event filtering and transformation logic before counting

Next, I think one of the best bang for buck performance monitoring features of  x86 EMON performance monitoring is the performance counter event filtering. RISC-V has only the most primitive version of this.

Every x86 performance counter has per counter event select logic.

In addition to that logic, there is a mask that specifies what modes to cvount in - User, OS, hypervisor.  I see that some of the Ri5 proposals also have that. Good.

But more filtering is also provided:  

Each counter  has a "Counter Mask" CMASK - really, a threshold. When non-zero, this is compared to the count of the selected event in any given cycle. If >= CMASK, the counter is incremented by 1; if less, no increment.

=> This comparison allows a HW event to be used to profile things like "Number of cycles in which 1, 2 or more, 3 or more ... events happened - e.g. how often you are able to to acheive superscalar execution.  In vectors, it might count how many vector elements are masked in or out.   If you have events that correspond to buffer occupancy, you can profile to see where the buffer is full or not.

INV - a bit that allows the CMASK comparison to be inverted.

=> so that you can count event > threshold, and event < threshold.

I would really have liked to have >, <, and == threshold.  And also the ability to increment by 1 if exceeding threshold, or by the actual count that exceeds the threshold. The former allows you to find where you are getting good superscalar behavior or not, the latter allows you to determine the average when exceeding the threshold or not. When I do find this I had to save hardware.

This masked comparison allows you to get more different types of events, for events that occur more than one per cycle. That's pretty good, abut it doesn't help you with scarce events, t events that only occur once every four or eight or an cycles.  Later, Intel added what I call  "push-out"  profiling: when the comparison condition is met,  e.g. when no instruction retires, a counter that increments one every clock cycle starts ; when the condition changes, the value of that counter is what is recorded, and naturally subject to all of the natural filtering.  That was too much hardware for me to add in 1991, but it proved very useful.

My first instinct is always to minimize hardware cost for performance monitoring hardware.   The nice thing about the filtering logic at the performance counters is that it removed the hardware cost from the individual unit like the cache,
and left it in the performance monitoring unit.  (The question of centralized versus decentralized performance counters is always an issue.  Suffice it to say that Intel P6 had for centralized performance counters, to save hardware;
Pentium 4 went to a fully distributed performance counter architecture, but users hated it, so until return to the centralized model at least architecturally, although microarchitecture implementations might be decentralized )

More filtering logic: each count has an E, edge select bit.  This counts when the condition described by the privilege level mask and CMASK comparison changes.    Using such edge filtering, you can determine the average length of bursts, e.g. the average length of a period that you have not been able to execute any instruction, and so on.   Simple filters can give you average lengths and occupancies; fancier and more expensive stuff is necessary to actually determine a distribution.   

Certain events themselves are sent to the performance counters as bitmasks.  E.g. the utilization of the execution unit ports as a bitmask - on the original P6 { ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on modern machines.  By controlling the UMASK field of the filter control logic for each performance counter, you could specify to count all instruction dispatches, or just loads, and so on.   Changing the UMASK field allowed you to profile to find out which parts of the machine were being used and which not.   (This proved successful enough to get me and the software guy who eventually started using it in achievement award.)

If I were to do it over again I would have a generic POPCNT as part of the filter logic, as well as the comparison.

Finally, simple filter stuff:

INT - Performance interrupt enable

PC - pin control - this predated me: it toggled an external pin when the performance counter overflowed.  The very first EMON event sampling took that external pin and wired it back to the NMI pin of the CPU.  Obviously, it is better to have internal logic for performance monitoring interrupts. Nevertheless, there is still a need for externally visible performance event sampling, e.g. freeze performance events outside the CPU, in the fabric, or in I/O devices.  Exactly what those are is obviously implementation dependent, but it's still good to have a standard way of controlling such implementation dependent features. I call this the "pin architecture", and IMHO maintaining such system compatibility was as much a factor in Intel's success as instruction set architecture.

---+ Performance counter freeze

There are always several performance counters. At least two per privilege level. At least a pair, so you can compute things like cashless rates and other ratios.     But not necessarily dedicated to any specific privilege level, because that would be wasteful: you can study things a hell of a lot more quickly if you can use all of the performance counters, when other modes are not using them.

When you have several performance counters, you are often measuring things together. You therefore need the ability to freeze them all at the same time. This means that you need to have all of the enable bits for all of the counters, or at least a subset, in the same CSR.   If you want, the enable bit can be in both the per counter control registers and in a central CSR - i.e. there can be multiple views of the same bit indifferent CSRs.

Performance analysts would really like the ability to freeze multiple CPUs performance counters at the same time. This is one motivation for that pin control signal

---+ Precise performance monitoring inputs - You can only wish!

When you are doing performance counter event interrupt based sampling, it would be really nice if the interrupt occurred exactly at the instruction that had the event.

If you can do that, great. However, it takes extra hardware to do that. Also, some events do not in any way correspond to a retired instruction - think events that occur on speculative instructions that never graduate/retire. Again, you can create special registers that record, say, the program counter of such a speculative instruction, but that is again extra hardware.

IMHO there is zero chance that all implementations, particularly the lowest cost of limitations, will make all performance events precise.

At the very least there should be a way of discovering whether an event is precise or not.   

Machine check architectures have the same issue.


---+ What should the priority of the hardware performance monitoring interrupts be?

One of the best things I did for Intel was punt on this issue: because I was also the architect in charge of the APIC interrupt controller, I provided a special LVT interrupt register just for the performance monitoring interrupt.

This allowed the performance monitoring interrupt to use all of the APIC features, all of those that made sense. For example, the performance monitoring interrupt that Linux uses is just the normal interrupt of some priority.  But as I mentioned above, the very first usage of performance monitoring interrupts used NMI, and was therefore able to profile code that had interrupts blocked.   The interrupt would be directed to SMM, the not very good baby virtual machine monitor, allowing performance monitoring to be done independent of the operating system. Very nice, when you don't have source code for the operating system. And so on. I can't remember, but it is possible that the interrupt could be directed to other processors other than the local processor.  However, that would not subsume the externally visible pin control, because the hardware pin can be a lot less expensive a lot more precise than signaling and enter processor interrupt.

I used a similar approach for machine check interrupts, which could also be directed to the operating system, NMI, SMM, hypervisor,…

By the way: I think Greg Favor said that x86 is performance monitoring interrupts are  level sensitive. That is not strictly speaking true: whether they are level sensitive or not is programmed into the AIPAC local vector table. You can make it all sensitive or edge triggered.

Obviously, however, when there are multiple performance counters bearing the same interrupt, you need to know which counter overflowed. Hence the sticky bits that Greg noticed in the manuals.


---+ Fancier stuff

The above is mostly about getting the most out of simple performance counters:  providing filter logic so that you can get the most insight out of the limited number of events;
providing enables in a central place so that you can freeze multiple counters at the same time;  allowing the performance counter interrupts to be directed not just a different privilege levels but in different interrupt priorities including NMI,  and possibly also external hardware.

There's a lot more stuff that can be done to help performance monitoring. Unfortunately, I have always worked at a place where I had to reduce  the performance monitoring hardware cost as much as possible.   I am sure, however, that many of you are familiar with fancier performance monitoring features, such as

+ Histogram counting (allowing you to count distributions without making multiple runs)
    => the CMASK comparators allowing very simple form of this, assuming you have enough performance counters. Actual histogram counters can do this more cheaply.

+ Cycle attribution - defining performance events so that you can actually say things like  "X% of cycles are spent waitying for memory".   

IMHO the single most important "advanced"  performance monitoring feature is what I call "longitudinal profiling".   AMD Instruction Based Sampling (IBS), DEC ProfileMe, ARM SPE (Statistical Profiling Extension).  The basic idea is to set a bit on some randomly selected instruction package somewhere high up in the pipeline, e.g. yet instruction fetch, and then let that bit flow down the pipeline, sampling things as it goes. E.g. you might sample the past missed latency or address,  or whether it produced a stall in interaction with a different marked instruction. This sort of profiling is quite expensive, e.g. requiring a bit in many places in the pipeline, as well as registers to record the sample data, but it provides a lot of insight: it can give you distributions and averages, it can tell you what interactions between instructions are causing problems.

However, if RISC-V cannot yet afford to do longitudinal profiling, the performance counter filter logic that I described above is low hanging fruit, much cheaper.




From: Alankao <alankao@...>
Sent: Monday, July 20, 2020 5:43PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail client, so please forgive my spam.

Hi Brian,

> I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies
> each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

> Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different
> register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

> Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its
> purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Andy Glew Si5
 

I meant "that's one of the reasons why P6's original performance  monitoring interrupt was imprecise"
that's one of the reasons why P6's  original  performance monitoring interrupt forcing precise
 darn that speech misrecognition: sometimes it inverts the meaning of what I say.  :-(



From: Andy Glew Si5 Via Lists.riscv.org <andy.glew=sifive.com@...>
Sent: Tuesday, July 21, 2020 5:30PM
To: Alankao <alankao@...>, Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/21/2020 5:30 PM, Andy Glew Si5 via lists.riscv.org wrote:
I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues?
x86's hardware performance monitoring interrupt  is delivered  to whatever is specified by the local APIC's LVT (Local Vector Table) entry for performance monitoring.    This gets it out of being a special case, and just makes it like any other interrupt. The hypervisor or virtual machine manager has to be able to handle any interrupt appropriately.  in a simple virtual machine architecture, such as the initial version of Intel VT that I worked on, all  external interrupts  go to the hypervisor, and then the hypervisor can decide if it wants to  deliver them to a guest privilege level. Fancier virtual machine architectures such as current Intel allow certain interrupts to be sent directly to  the guest, without  being caught by the  hypervisor first.

There should not be any special handling for hardware performance monitoring interests. It should be just like any other  interrupt or exception. There should be a uniform delegation architecture for all interests and traps.   Eliminate as many special cases as possible.

For any given interrupt or exception, sometimes you wanted to go straight to the hypervisor, sometimes you wanted to go straight to the guest..

I say "hypervisor" here,  but it might just as well be M-mode: or generalize, sometimes you want  it to go to the most privileged software level, sometimes to the least, sometimes one of the privileged software levels in between. The interrupt  architecture should support that.

--

There's a bit of funkiness with respect to precise  performance monitoring exceptions just like there is for machine check. If you go through a complicated interrupt vectoring mechanism, it may become  difficult to be precise. In fact, that's one of the reasons why P6's  original  performance monitoring interrupt forcing precise (that,  and the fact that it took several cycles to propagate from the unit where the event occurred to the performance counter logic, and not even a uniform number cycles -  there was a tree of wires with differing numbers  of latches on different paths).

But that is okay-ish:   You can either have an interlock to  prevent more instructions from retiring after the instruction where the precise performance monitor event has occurred.    taking care to avoid deadlock, e.g. taking care that a higher priority interrupt can preempt while that interlock is in flight. Or you can add the mechanisms  To provide appropriate sampling when interrupts are actually imprecise. Or, you can add a new interrupt/exception delivery mechanism  but basically does the first thing, but throw it out some of the complexity of your legacy trip delivery mechanism. It's microarchitecture.

By the way, if your performance counter takes more than one cycle to propagate carry  and detect overflow, you need such an interlock anyway.  That is not a common problem, but at Intel circa 2000  we regularly  imagined a "fireball" OOO core that  ran out of frequencies such that you could only propagate a carry  across 16 bits at a time;  if you are wire limited rather than logic limited, for cycles for 64-bit counter.   In fact, I believe that Intel PEBS (Precise Event Based Sampling)  does not actually sample when the counter overflows;  instead, when the counter overflows, it sets a bit that says the next time this event is recognized, then generate the interrupt. Which, if you think about it,  is actually imprecise, if more than one event occurs in any given cycle.

 However, propagating through the interrupt delivery logic probably takes more cycles than propagating a carry 64-bits.




From: Alankao <alankao@...>
Sent: Tuesday, July 21, 2020 4:40PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/21/2020 4:40 PM, alankao wrote:
Hi Andy,

Thank you for the hints as an Intel PMU architect.  My question is about the mode selection part as below.

It is not difficult to implement such a mechanism that an event should only be counted in some privileged modes.  Both Greg's and my approach can achieve this. But in practice, we found profiling higher-privileged modes has some problems. Under basic Unix-like RISC-V configuration, the kernel runs in S-mode and there is M-mode for platform-specific stuff. 

Says we now want to sample M-mode software. The first implement decision is which mode the HPM interrupt should go. Everything can be more controllable if the interrupt can just go S-mode, but obviously there is no easy way for S-mode software, the kernel, to read general M-mode information like mepc (Machine Exception Program Counter) register.  The other route goes to M-mode, but since RISC-V HPM interrupt has never been seriously/publicly discussed until this thread, the effort so far including current PMU SBI extension proposal did not address this.

I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues?

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Andy Glew Si5
 

I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues?
x86's hardware performance monitoring interrupt  is delivered  to whatever is specified by the local APIC's LVT (Local Vector Table) entry for performance monitoring.    This gets it out of being a special case, and just makes it like any other interrupt. The hypervisor or virtual machine manager has to be able to handle any interrupt appropriately.  in a simple virtual machine architecture, such as the initial version of Intel VT that I worked on, all  external interrupts  go to the hypervisor, and then the hypervisor can decide if it wants to  deliver them to a guest privilege level. Fancier virtual machine architectures such as current Intel allow certain interrupts to be sent directly to  the guest, without  being caught by the  hypervisor first.

There should not be any special handling for hardware performance monitoring interests. It should be just like any other  interrupt or exception. There should be a uniform delegation architecture for all interests and traps.   Eliminate as many special cases as possible.

For any given interrupt or exception, sometimes you wanted to go straight to the hypervisor, sometimes you wanted to go straight to the guest..

I say "hypervisor" here,  but it might just as well be M-mode: or generalize, sometimes you want  it to go to the most privileged software level, sometimes to the least, sometimes one of the privileged software levels in between. The interrupt  architecture should support that.

--

There's a bit of funkiness with respect to precise  performance monitoring exceptions just like there is for machine check. If you go through a complicated interrupt vectoring mechanism, it may become  difficult to be precise. In fact, that's one of the reasons why P6's  original  performance monitoring interrupt forcing precise (that,  and the fact that it took several cycles to propagate from the unit where the event occurred to the performance counter logic, and not even a uniform number cycles -  there was a tree of wires with differing numbers  of latches on different paths).

But that is okay-ish:   You can either have an interlock to  prevent more instructions from retiring after the instruction where the precise performance monitor event has occurred.    taking care to avoid deadlock, e.g. taking care that a higher priority interrupt can preempt while that interlock is in flight. Or you can add the mechanisms  To provide appropriate sampling when interrupts are actually imprecise. Or, you can add a new interrupt/exception delivery mechanism  but basically does the first thing, but throw it out some of the complexity of your legacy trip delivery mechanism. It's microarchitecture.

By the way, if your performance counter takes more than one cycle to propagate carry  and detect overflow, you need such an interlock anyway.  That is not a common problem, but at Intel circa 2000  we regularly  imagined a "fireball" OOO core that  ran out of frequencies such that you could only propagate a carry  across 16 bits at a time;  if you are wire limited rather than logic limited, for cycles for 64-bit counter.   In fact, I believe that Intel PEBS (Precise Event Based Sampling)  does not actually sample when the counter overflows;  instead, when the counter overflows, it sets a bit that says the next time this event is recognized, then generate the interrupt. Which, if you think about it,  is actually imprecise, if more than one event occurs in any given cycle.

 However, propagating through the interrupt delivery logic probably takes more cycles than propagating a carry 64-bits.




From: Alankao <alankao@...>
Sent: Tuesday, July 21, 2020 4:40PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/21/2020 4:40 PM, alankao wrote:
Hi Andy,

Thank you for the hints as an Intel PMU architect.  My question is about the mode selection part as below.

It is not difficult to implement such a mechanism that an event should only be counted in some privileged modes.  Both Greg's and my approach can achieve this. But in practice, we found profiling higher-privileged modes has some problems. Under basic Unix-like RISC-V configuration, the kernel runs in S-mode and there is M-mode for platform-specific stuff. 

Says we now want to sample M-mode software. The first implement decision is which mode the HPM interrupt should go. Everything can be more controllable if the interrupt can just go S-mode, but obviously there is no easy way for S-mode software, the kernel, to read general M-mode information like mepc (Machine Exception Program Counter) register.  The other route goes to M-mode, but since RISC-V HPM interrupt has never been seriously/publicly discussed until this thread, the effort so far including current PMU SBI extension proposal did not address this.

I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues?

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

alankao
 

Hi Andy,

Thank you for the hints as an Intel PMU architect.  My question is about the mode selection part as below.

It is not difficult to implement such a mechanism that an event should only be counted in some privileged modes.  Both Greg's and my approach can achieve this. But in practice, we found profiling higher-privileged modes has some problems. Under basic Unix-like RISC-V configuration, the kernel runs in S-mode and there is M-mode for platform-specific stuff. 

Says we now want to sample M-mode software. The first implement decision is which mode the HPM interrupt should go. Everything can be more controllable if the interrupt can just go S-mode, but obviously there is no easy way for S-mode software, the kernel, to read general M-mode information like mepc (Machine Exception Program Counter) register.  The other route goes to M-mode, but since RISC-V HPM interrupt has never been seriously/publicly discussed until this thread, the effort so far including current PMU SBI extension proposal did not address this.

I am curious how x86 address this problem. How does it enable hypervisor mode sampling without similar issues?


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Andy Glew Si5
 

BTW, I have absolutely no idea  what changes would be necessary to the RISC-V HPM CSRs.

I do note however that the x86 control bits pretty much lived in one CSR per counter.    later versions added a few central counters for things like having all of the enable bits in one place you can get global freeze.


From: Andy Glew Si5 Via Lists.riscv.org <andy.glew=sifive.com@...>
Sent: Tuesday, July 21, 2020 3:53PM
To: Alankao <alankao@...>, Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/21/2020 3:53 PM, Andy Glew Si5 via lists.riscv.org wrote:
I have NOT been working on a RISC-V performance monitoring proposal, but I've been involved with performance monitoring first as a user then is an architect for many years and at several companies.

I would like to draw this group's attention to some features of Intel x86 performance monitoring  that turned out pretty successful.

First, you're already talking about hardware performance monitoring interrupts for statistical profiling. Good. A few comments on that below.

But, I think one of the best bang for buck performance monitoring features of  Sadie six MIN/performance monitoringis the performance counter event filtering

---+ Performance event filtering and transformation logic before counting

Next, I think one of the best bang for buck performance monitoring features of  x86 EMON performance monitoring is the performance counter event filtering. RISC-V has only the most primitive version of this.

Every x86 performance counter has per counter event select logic.

In addition to that logic, there is a mask that specifies what modes to cvount in - User, OS, hypervisor.  I see that some of the Ri5 proposals also have that. Good.

But more filtering is also provided:  

Each counter  has a "Counter Mask" CMASK - really, a threshold. When non-zero, this is compared to the count of the selected event in any given cycle. If >= CMASK, the counter is incremented by 1; if less, no increment.

=> This comparison allows a HW event to be used to profile things like "Number of cycles in which 1, 2 or more, 3 or more ... events happened - e.g. how often you are able to to acheive superscalar execution.  In vectors, it might count how many vector elements are masked in or out.   If you have events that correspond to buffer occupancy, you can profile to see where the buffer is full or not.

INV - a bit that allows the CMASK comparison to be inverted.

=> so that you can count event > threshold, and event < threshold.

I would really have liked to have >, <, and == threshold.  And also the ability to increment by 1 if exceeding threshold, or by the actual count that exceeds the threshold. The former allows you to find where you are getting good superscalar behavior or not, the latter allows you to determine the average when exceeding the threshold or not. When I do find this I had to save hardware.

This masked comparison allows you to get more different types of events, for events that occur more than one per cycle. That's pretty good, abut it doesn't help you with scarce events, t events that only occur once every four or eight or an cycles.  Later, Intel added what I call  "push-out"  profiling: when the comparison condition is met,  e.g. when no instruction retires, a counter that increments one every clock cycle starts ; when the condition changes, the value of that counter is what is recorded, and naturally subject to all of the natural filtering.  That was too much hardware for me to add in 1991, but it proved very useful.

My first instinct is always to minimize hardware cost for performance monitoring hardware.   The nice thing about the filtering logic at the performance counters is that it removed the hardware cost from the individual unit like the cache,
and left it in the performance monitoring unit.  (The question of centralized versus decentralized performance counters is always an issue.  Suffice it to say that Intel P6 had for centralized performance counters, to save hardware;
Pentium 4 went to a fully distributed performance counter architecture, but users hated it, so until return to the centralized model at least architecturally, although microarchitecture implementations might be decentralized )

More filtering logic: each count has an E, edge select bit.  This counts when the condition described by the privilege level mask and CMASK comparison changes.    Using such edge filtering, you can determine the average length of bursts, e.g. the average length of a period that you have not been able to execute any instruction, and so on.   Simple filters can give you average lengths and occupancies; fancier and more expensive stuff is necessary to actually determine a distribution.   

Certain events themselves are sent to the performance counters as bitmasks.  E.g. the utilization of the execution unit ports as a bitmask - on the original P6 { ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on modern machines.  By controlling the UMASK field of the filter control logic for each performance counter, you could specify to count all instruction dispatches, or just loads, and so on.   Changing the UMASK field allowed you to profile to find out which parts of the machine were being used and which not.   (This proved successful enough to get me and the software guy who eventually started using it in achievement award.)

If I were to do it over again I would have a generic POPCNT as part of the filter logic, as well as the comparison.

Finally, simple filter stuff:

INT - Performance interrupt enable

PC - pin control - this predated me: it toggled an external pin when the performance counter overflowed.  The very first EMON event sampling took that external pin and wired it back to the NMI pin of the CPU.  Obviously, it is better to have internal logic for performance monitoring interrupts. Nevertheless, there is still a need for externally visible performance event sampling, e.g. freeze performance events outside the CPU, in the fabric, or in I/O devices.  Exactly what those are is obviously implementation dependent, but it's still good to have a standard way of controlling such implementation dependent features. I call this the "pin architecture", and IMHO maintaining such system compatibility was as much a factor in Intel's success as instruction set architecture.

---+ Performance counter freeze

There are always several performance counters. At least two per privilege level. At least a pair, so you can compute things like cashless rates and other ratios.     But not necessarily dedicated to any specific privilege level, because that would be wasteful: you can study things a hell of a lot more quickly if you can use all of the performance counters, when other modes are not using them.

When you have several performance counters, you are often measuring things together. You therefore need the ability to freeze them all at the same time. This means that you need to have all of the enable bits for all of the counters, or at least a subset, in the same CSR.   If you want, the enable bit can be in both the per counter control registers and in a central CSR - i.e. there can be multiple views of the same bit indifferent CSRs.

Performance analysts would really like the ability to freeze multiple CPUs performance counters at the same time. This is one motivation for that pin control signal

---+ Precise performance monitoring inputs - You can only wish!

When you are doing performance counter event interrupt based sampling, it would be really nice if the interrupt occurred exactly at the instruction that had the event.

If you can do that, great. However, it takes extra hardware to do that. Also, some events do not in any way correspond to a retired instruction - think events that occur on speculative instructions that never graduate/retire. Again, you can create special registers that record, say, the program counter of such a speculative instruction, but that is again extra hardware.

IMHO there is zero chance that all implementations, particularly the lowest cost of limitations, will make all performance events precise.

At the very least there should be a way of discovering whether an event is precise or not.   

Machine check architectures have the same issue.


---+ What should the priority of the hardware performance monitoring interrupts be?

One of the best things I did for Intel was punt on this issue: because I was also the architect in charge of the APIC interrupt controller, I provided a special LVT interrupt register just for the performance monitoring interrupt.

This allowed the performance monitoring interrupt to use all of the APIC features, all of those that made sense. For example, the performance monitoring interrupt that Linux uses is just the normal interrupt of some priority.  But as I mentioned above, the very first usage of performance monitoring interrupts used NMI, and was therefore able to profile code that had interrupts blocked.   The interrupt would be directed to SMM, the not very good baby virtual machine monitor, allowing performance monitoring to be done independent of the operating system. Very nice, when you don't have source code for the operating system. And so on. I can't remember, but it is possible that the interrupt could be directed to other processors other than the local processor.  However, that would not subsume the externally visible pin control, because the hardware pin can be a lot less expensive a lot more precise than signaling and enter processor interrupt.

I used a similar approach for machine check interrupts, which could also be directed to the operating system, NMI, SMM, hypervisor,…

By the way: I think Greg Favor said that x86 is performance monitoring interrupts are  level sensitive. That is not strictly speaking true: whether they are level sensitive or not is programmed into the AIPAC local vector table. You can make it all sensitive or edge triggered.

Obviously, however, when there are multiple performance counters bearing the same interrupt, you need to know which counter overflowed. Hence the sticky bits that Greg noticed in the manuals.


---+ Fancier stuff

The above is mostly about getting the most out of simple performance counters:  providing filter logic so that you can get the most insight out of the limited number of events;
providing enables in a central place so that you can freeze multiple counters at the same time;  allowing the performance counter interrupts to be directed not just a different privilege levels but in different interrupt priorities including NMI,  and possibly also external hardware.

There's a lot more stuff that can be done to help performance monitoring. Unfortunately, I have always worked at a place where I had to reduce  the performance monitoring hardware cost as much as possible.   I am sure, however, that many of you are familiar with fancier performance monitoring features, such as

+ Histogram counting (allowing you to count distributions without making multiple runs)
    => the CMASK comparators allowing very simple form of this, assuming you have enough performance counters. Actual histogram counters can do this more cheaply.

+ Cycle attribution - defining performance events so that you can actually say things like  "X% of cycles are spent waitying for memory".   

IMHO the single most important "advanced"  performance monitoring feature is what I call "longitudinal profiling".   AMD Instruction Based Sampling (IBS), DEC ProfileMe, ARM SPE (Statistical Profiling Extension).  The basic idea is to set a bit on some randomly selected instruction package somewhere high up in the pipeline, e.g. yet instruction fetch, and then let that bit flow down the pipeline, sampling things as it goes. E.g. you might sample the past missed latency or address,  or whether it produced a stall in interaction with a different marked instruction. This sort of profiling is quite expensive, e.g. requiring a bit in many places in the pipeline, as well as registers to record the sample data, but it provides a lot of insight: it can give you distributions and averages, it can tell you what interactions between instructions are causing problems.

However, if RISC-V cannot yet afford to do longitudinal profiling, the performance counter filter logic that I described above is low hanging fruit, much cheaper.




From: Alankao <alankao@...>
Sent: Monday, July 20, 2020 5:43PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail client, so please forgive my spam.

Hi Brian,

> I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies
> each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

> Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different
> register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

> Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its
> purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Andy Glew Si5
 

I have NOT been working on a RISC-V performance monitoring proposal, but I've been involved with performance monitoring first as a user then is an architect for many years and at several companies.

I would like to draw this group's attention to some features of Intel x86 performance monitoring  that turned out pretty successful.

First, you're already talking about hardware performance monitoring interrupts for statistical profiling. Good. A few comments on that below.

But, I think one of the best bang for buck performance monitoring features of  Sadie six MIN/performance monitoringis the performance counter event filtering

---+ Performance event filtering and transformation logic before counting

Next, I think one of the best bang for buck performance monitoring features of  x86 EMON performance monitoring is the performance counter event filtering. RISC-V has only the most primitive version of this.

Every x86 performance counter has per counter event select logic.

In addition to that logic, there is a mask that specifies what modes to cvount in - User, OS, hypervisor.  I see that some of the Ri5 proposals also have that. Good.

But more filtering is also provided:  

Each counter  has a "Counter Mask" CMASK - really, a threshold. When non-zero, this is compared to the count of the selected event in any given cycle. If >= CMASK, the counter is incremented by 1; if less, no increment.

=> This comparison allows a HW event to be used to profile things like "Number of cycles in which 1, 2 or more, 3 or more ... events happened - e.g. how often you are able to to acheive superscalar execution.  In vectors, it might count how many vector elements are masked in or out.   If you have events that correspond to buffer occupancy, you can profile to see where the buffer is full or not.

INV - a bit that allows the CMASK comparison to be inverted.

=> so that you can count event > threshold, and event < threshold.

I would really have liked to have >, <, and == threshold.  And also the ability to increment by 1 if exceeding threshold, or by the actual count that exceeds the threshold. The former allows you to find where you are getting good superscalar behavior or not, the latter allows you to determine the average when exceeding the threshold or not. When I do find this I had to save hardware.

This masked comparison allows you to get more different types of events, for events that occur more than one per cycle. That's pretty good, abut it doesn't help you with scarce events, t events that only occur once every four or eight or an cycles.  Later, Intel added what I call  "push-out"  profiling: when the comparison condition is met,  e.g. when no instruction retires, a counter that increments one every clock cycle starts ; when the condition changes, the value of that counter is what is recorded, and naturally subject to all of the natural filtering.  That was too much hardware for me to add in 1991, but it proved very useful.

My first instinct is always to minimize hardware cost for performance monitoring hardware.   The nice thing about the filtering logic at the performance counters is that it removed the hardware cost from the individual unit like the cache,
and left it in the performance monitoring unit.  (The question of centralized versus decentralized performance counters is always an issue.  Suffice it to say that Intel P6 had for centralized performance counters, to save hardware;
Pentium 4 went to a fully distributed performance counter architecture, but users hated it, so until return to the centralized model at least architecturally, although microarchitecture implementations might be decentralized )

More filtering logic: each count has an E, edge select bit.  This counts when the condition described by the privilege level mask and CMASK comparison changes.    Using such edge filtering, you can determine the average length of bursts, e.g. the average length of a period that you have not been able to execute any instruction, and so on.   Simple filters can give you average lengths and occupancies; fancier and more expensive stuff is necessary to actually determine a distribution.   

Certain events themselves are sent to the performance counters as bitmasks.  E.g. the utilization of the execution unit ports as a bitmask - on the original P6 { ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on modern machines.  By controlling the UMASK field of the filter control logic for each performance counter, you could specify to count all instruction dispatches, or just loads, and so on.   Changing the UMASK field allowed you to profile to find out which parts of the machine were being used and which not.   (This proved successful enough to get me and the software guy who eventually started using it in achievement award.)

If I were to do it over again I would have a generic POPCNT as part of the filter logic, as well as the comparison.

Finally, simple filter stuff:

INT - Performance interrupt enable

PC - pin control - this predated me: it toggled an external pin when the performance counter overflowed.  The very first EMON event sampling took that external pin and wired it back to the NMI pin of the CPU.  Obviously, it is better to have internal logic for performance monitoring interrupts. Nevertheless, there is still a need for externally visible performance event sampling, e.g. freeze performance events outside the CPU, in the fabric, or in I/O devices.  Exactly what those are is obviously implementation dependent, but it's still good to have a standard way of controlling such implementation dependent features. I call this the "pin architecture", and IMHO maintaining such system compatibility was as much a factor in Intel's success as instruction set architecture.

---+ Performance counter freeze

There are always several performance counters. At least two per privilege level. At least a pair, so you can compute things like cashless rates and other ratios.     But not necessarily dedicated to any specific privilege level, because that would be wasteful: you can study things a hell of a lot more quickly if you can use all of the performance counters, when other modes are not using them.

When you have several performance counters, you are often measuring things together. You therefore need the ability to freeze them all at the same time. This means that you need to have all of the enable bits for all of the counters, or at least a subset, in the same CSR.   If you want, the enable bit can be in both the per counter control registers and in a central CSR - i.e. there can be multiple views of the same bit indifferent CSRs.

Performance analysts would really like the ability to freeze multiple CPUs performance counters at the same time. This is one motivation for that pin control signal

---+ Precise performance monitoring inputs - You can only wish!

When you are doing performance counter event interrupt based sampling, it would be really nice if the interrupt occurred exactly at the instruction that had the event.

If you can do that, great. However, it takes extra hardware to do that. Also, some events do not in any way correspond to a retired instruction - think events that occur on speculative instructions that never graduate/retire. Again, you can create special registers that record, say, the program counter of such a speculative instruction, but that is again extra hardware.

IMHO there is zero chance that all implementations, particularly the lowest cost of limitations, will make all performance events precise.

At the very least there should be a way of discovering whether an event is precise or not.   

Machine check architectures have the same issue.


---+ What should the priority of the hardware performance monitoring interrupts be?

One of the best things I did for Intel was punt on this issue: because I was also the architect in charge of the APIC interrupt controller, I provided a special LVT interrupt register just for the performance monitoring interrupt.

This allowed the performance monitoring interrupt to use all of the APIC features, all of those that made sense. For example, the performance monitoring interrupt that Linux uses is just the normal interrupt of some priority.  But as I mentioned above, the very first usage of performance monitoring interrupts used NMI, and was therefore able to profile code that had interrupts blocked.   The interrupt would be directed to SMM, the not very good baby virtual machine monitor, allowing performance monitoring to be done independent of the operating system. Very nice, when you don't have source code for the operating system. And so on. I can't remember, but it is possible that the interrupt could be directed to other processors other than the local processor.  However, that would not subsume the externally visible pin control, because the hardware pin can be a lot less expensive a lot more precise than signaling and enter processor interrupt.

I used a similar approach for machine check interrupts, which could also be directed to the operating system, NMI, SMM, hypervisor,…

By the way: I think Greg Favor said that x86 is performance monitoring interrupts are  level sensitive. That is not strictly speaking true: whether they are level sensitive or not is programmed into the AIPAC local vector table. You can make it all sensitive or edge triggered.

Obviously, however, when there are multiple performance counters bearing the same interrupt, you need to know which counter overflowed. Hence the sticky bits that Greg noticed in the manuals.


---+ Fancier stuff

The above is mostly about getting the most out of simple performance counters:  providing filter logic so that you can get the most insight out of the limited number of events;
providing enables in a central place so that you can freeze multiple counters at the same time;  allowing the performance counter interrupts to be directed not just a different privilege levels but in different interrupt priorities including NMI,  and possibly also external hardware.

There's a lot more stuff that can be done to help performance monitoring. Unfortunately, I have always worked at a place where I had to reduce  the performance monitoring hardware cost as much as possible.   I am sure, however, that many of you are familiar with fancier performance monitoring features, such as

+ Histogram counting (allowing you to count distributions without making multiple runs)
    => the CMASK comparators allowing very simple form of this, assuming you have enough performance counters. Actual histogram counters can do this more cheaply.

+ Cycle attribution - defining performance events so that you can actually say things like  "X% of cycles are spent waitying for memory".   

IMHO the single most important "advanced"  performance monitoring feature is what I call "longitudinal profiling".   AMD Instruction Based Sampling (IBS), DEC ProfileMe, ARM SPE (Statistical Profiling Extension).  The basic idea is to set a bit on some randomly selected instruction package somewhere high up in the pipeline, e.g. yet instruction fetch, and then let that bit flow down the pipeline, sampling things as it goes. E.g. you might sample the past missed latency or address,  or whether it produced a stall in interaction with a different marked instruction. This sort of profiling is quite expensive, e.g. requiring a bit in many places in the pipeline, as well as registers to record the sample data, but it provides a lot of insight: it can give you distributions and averages, it can tell you what interactions between instructions are causing problems.

However, if RISC-V cannot yet afford to do longitudinal profiling, the performance counter filter logic that I described above is low hanging fruit, much cheaper.




From: Alankao <alankao@...>
Sent: Monday, July 20, 2020 5:43PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail client, so please forgive my spam.

Hi Brian,

> I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies
> each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

> Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different
> register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

> Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its
> purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".


Re: Caching and sfence'ing (or not) of satp Bare mode "translations"

Greg Favor
 

On Tue, Jul 21, 2020 at 6:09 AM Jonathan Behrens <behrensj@...> wrote:
My understanding is that sfence.vma's are never required by the RISC-V spec, only that failing to do them can cause undesirable but well defined behavior.

The preceding is true, but your following paragraph isn't quite true.  In the current architecture spec one isn't free to hijack or reuse ASID=0 in the way I think you are describing.  It might be nice if that was allowed, but I don't believe it is.  As far as trying to change the arch spec to allow something like this, I'm not pushing for that (and expect that would meet a lot of resistance).

Greg
 

I'd suggest that the same be true here. We could consider the Bare mode to reuse ASID=0, and therefore software would have to do a sfence.vma only if there were stale mappings in that ASID that it didn't want to be used. My (completely uninformed) guess is that it shouldn't be too difficult for hardware to ignore global mappings when in Bare mode, but if people think otherwise, then the spec could say those also need to be flushed.

Overall I agree that this case doesn't need to be fast, but still should be consistent with how RISC-V does things in other places. And if it is possible to make TLB flushes restricted to a single ASID rather than global across all of them, then I think it makes sense to try to achieve that.

Jonathan

On Mon, Jul 20, 2020 at 11:58 PM Greg Favor via lists.riscv.org <gfavor=ventanamicro.com@...> wrote:
Comments below:

On Mon, Jul 20, 2020 at 8:14 PM Bill Huffman <huffman@...> wrote:

Hi Greg,

My sense is that the transitions from SvXX to Bare and from Bare to the same SvXX that was previously in force are special transitions.  One reason seems to me the extreme simplicity of Bare compared with other modes.  It's easier to switch.

Switches to/from Bare mode should be rare.  Typically one will switch from Bare mode to a translated mode as part of booting up an OS (e.g. Linux), and then will remain in that mode (until, say, the system crashes and must reboot).  Further, all such switches are performed under full software control.

Switching to/from M-mode on the other hand is frequent and often hardware initiated.  Also, any sfence.vma on M-mode exit would have to be after the exit (in potentially arbitrary code that happens to be at the return address).

Hence sfence'ing M-mode entry/exit is impractical as well as something that needs to be performant.   Whereas Bare mode entry/exit is rare and software-managed.

If we require sfence.vma after a switch to or from Bare, does that also mean we have to require one after a switch to or from M-mode?  If no, why is it different? 

If a high-performance design caches "translations" in all modes of operation (including M-mode) in the TLBs, then M-mode translations must be distinguished from S/HS/U mode translations, which must be distinguished from VS/VU mode translations.  That is a small set of three translation regimes (to use an ARMv8 term) for hardware to support and handle properly.

If one has to also distinguish Bare and non-Bare modes within the S/HS/U translation regime, that effectively becomes two separate translation regimes.  Similarly, with the H-extension and two-stage translations inside VM's, the VS/VU regime needs to become four regimes (the four combinations of Bare and non-Bare stage 1 and stage 2 modes).  Consequently TLB entries and surrounding logic now need to distinguish between and properly handle seven translation regimes.  All to handle rare cases.

That, like most things, is doable, but isn't the whole point of a RISC architecture to reduce hardware cost and complexity and shift that to software where the software burden is small and the performance cost is minimal?

Greg

P.S. One could imagine instead doing data-dependent implicit sfence.vma operation on CSR writes to the *atp registers, but besides being data-dependent (which RISC-V avoids when it comes to having data-dependent exceptions) that is a rather CISC'y thing to do.  Which goes back to my preceding point.
 

If yes, it will cost more to switch briefly to M-mode than I'd want it to.

      Bill

On 7/20/20 7:08 PM, Greg Favor wrote:
EXTERNAL MAIL

I would like to get people's views on the question of when is an sfence.vma required after changing the satp.mode field (to see what support there is for the following change/clarification in the Privileged spec).

Currently an sfence.vma is required when changing between SvXX modes, but is not required when changing to/from Bare mode.

In both cases there is an implicit wholesale change to the page tables, i.e. the translation of any given address generally has suddenly changed.

For some designs (that cache Bare mode translations in the TLBs for the sake of caching the PMA and PMP info for an address), having software be required to do an sfence.vma can simplify the hardware.

So the question is whether there should be architectural consistency in requiring sfence'ing after changing satp.mode (i.e. all mode changes require an sfence), versus having some mode cases being treated differently (i.e. changes to/from Bare mode not requiring an sfence)?

My (and Ventana's) bias is towards the former - for our sake and for other future high performance CPU designs by others.  But I'm interested to see if others feel similarly or not.

Greg


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Bill Huffman
 


On 7/21/20 11:56 AM, Greg Favor wrote:
EXTERNAL MAIL

What I ultimately understood from Brian's email's is that this 'marked' bit conceptually is not a state bit associated with each counter, but a state bit maintained separately by software (as it transitions into and out of regions of code that it wants to be viewed as "marked" or not).  Then counters can be configured (via their event selection) to count either marked events or unmarked events or both.

With that view, the 'marked' bit wants to be changeable directly by the software without having to call into M-mode and without those bits of software having to be aware of which counters were configured to count marked events, or non-marked events, or both.  This is why one doesn't want to be trying to use mcountinhibit bits.
I agree about calling M-mode.  But I think the software that swaps entire mcountinhibit registers doesn't have to know anything about the bits in them either.

In any case, I'm just trying to represent what I understand to be Brian's request.  But as he also acknowledged, the primary use case is probably bare-metal embedded systems and may not be more generally relevant.

As far as the 'Active' bit in my proposal, it allows for both software and hardware events to set/clear a counter's Active bit.  That's more interesting when one has hardware cross-trigger events from the debug Trigger Module and from specific counter overflows.  But since this all wants to go together, I would be fine with removing the Active bit from my proposal.  This simple set of cross-trigger capabilities (to create rich events that can be counted and to create richer debug triggers and richer trace control) is best treated as a separate (future) extension proposal that may or may not catch enough people's interest.  (We're all for standardizing these richer capabilities in some form, but if that doesn't happen then we (Ventana) will implement this as our own custom stuff.)

Greg

I'm not trying to argue for or against any particular thing - including the Active bit - at this point.  Just wanting a full understanding....

      Bill


On Tue, Jul 21, 2020 at 11:08 AM Bill Huffman <huffman@...> wrote:

Brian,

I think the advantages you explain are exactly the reason why I was asking whether the marked bit was a significant improvement.  I think the functionality is extremely similar to swapping mcountinhibit CSR values.  For example, there can be one counter that counts marked activities and another that counts non-marked activities.  This would be done by having two counters set the same ways except that one inhibit bit was set in one mcountinhibit value and the other was set in the other.

It seems to me that the advantage of the marked bit is that there isn't a register to swap.  But the bit in the status register still needs to be changed at those same times.  So, I'm seeing only tiny differences - such as that, with the marked bit, there's no need for a location to save a mcountinhibit value when it's not being used.  And presumably supervisor can change the marked bit in status.  On the other hand, the mcountinhibit method allows the effect of more than two values of the marked bit by having more than two mcountinhibit values.

So, it doesn't seem to me a very large improvement once mcountinhibit already exists, even for the embedded cases where context-switching of permon registers is not done.

      Bill


On 7/20/20 10:25 PM, Brian Grayson wrote:
EXTERNAL MAIL

The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian

On Mon, Jul 20, 2020 at 11:30 PM Greg Favor <gfavor@...> wrote:
Bill,

Hopefully my last email also answers your question.

Greg

On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

What I ultimately understood from Brian's email's is that this 'marked' bit conceptually is not a state bit associated with each counter, but a state bit maintained separately by software (as it transitions into and out of regions of code that it wants to be viewed as "marked" or not).  Then counters can be configured (via their event selection) to count either marked events or unmarked events or both.

With that view, the 'marked' bit wants to be changeable directly by the software without having to call into M-mode and without those bits of software having to be aware of which counters were configured to count marked events, or non-marked events, or both.  This is why one doesn't want to be trying to use mcountinhibit bits.

In any case, I'm just trying to represent what I understand to be Brian's request.  But as he also acknowledged, the primary use case is probably bare-metal embedded systems and may not be more generally relevant.

As far as the 'Active' bit in my proposal, it allows for both software and hardware events to set/clear a counter's Active bit.  That's more interesting when one has hardware cross-trigger events from the debug Trigger Module and from specific counter overflows.  But since this all wants to go together, I would be fine with removing the Active bit from my proposal.  This simple set of cross-trigger capabilities (to create rich events that can be counted and to create richer debug triggers and richer trace control) is best treated as a separate (future) extension proposal that may or may not catch enough people's interest.  (We're all for standardizing these richer capabilities in some form, but if that doesn't happen then we (Ventana) will implement this as our own custom stuff.)

Greg


On Tue, Jul 21, 2020 at 11:08 AM Bill Huffman <huffman@...> wrote:

Brian,

I think the advantages you explain are exactly the reason why I was asking whether the marked bit was a significant improvement.  I think the functionality is extremely similar to swapping mcountinhibit CSR values.  For example, there can be one counter that counts marked activities and another that counts non-marked activities.  This would be done by having two counters set the same ways except that one inhibit bit was set in one mcountinhibit value and the other was set in the other.

It seems to me that the advantage of the marked bit is that there isn't a register to swap.  But the bit in the status register still needs to be changed at those same times.  So, I'm seeing only tiny differences - such as that, with the marked bit, there's no need for a location to save a mcountinhibit value when it's not being used.  And presumably supervisor can change the marked bit in status.  On the other hand, the mcountinhibit method allows the effect of more than two values of the marked bit by having more than two mcountinhibit values.

So, it doesn't seem to me a very large improvement once mcountinhibit already exists, even for the embedded cases where context-switching of permon registers is not done.

      Bill


On 7/20/20 10:25 PM, Brian Grayson wrote:
EXTERNAL MAIL

The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian

On Mon, Jul 20, 2020 at 11:30 PM Greg Favor <gfavor@...> wrote:
Bill,

Hopefully my last email also answers your question.

Greg

On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Bill Huffman
 

Brian,

I think the advantages you explain are exactly the reason why I was asking whether the marked bit was a significant improvement.  I think the functionality is extremely similar to swapping mcountinhibit CSR values.  For example, there can be one counter that counts marked activities and another that counts non-marked activities.  This would be done by having two counters set the same ways except that one inhibit bit was set in one mcountinhibit value and the other was set in the other.

It seems to me that the advantage of the marked bit is that there isn't a register to swap.  But the bit in the status register still needs to be changed at those same times.  So, I'm seeing only tiny differences - such as that, with the marked bit, there's no need for a location to save a mcountinhibit value when it's not being used.  And presumably supervisor can change the marked bit in status.  On the other hand, the mcountinhibit method allows the effect of more than two values of the marked bit by having more than two mcountinhibit values.

So, it doesn't seem to me a very large improvement once mcountinhibit already exists, even for the embedded cases where context-switching of permon registers is not done.

      Bill


On 7/20/20 10:25 PM, Brian Grayson wrote:
EXTERNAL MAIL

The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian

On Mon, Jul 20, 2020 at 11:30 PM Greg Favor <gfavor@...> wrote:
Bill,

Hopefully my last email also answers your question.

Greg

On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Re: Caching and sfence'ing (or not) of satp Bare mode "translations"

Jonathan Behrens <behrensj@...>
 

My understanding is that sfence.vma's are never required by the RISC-V spec, only that failing to do them can cause undesirable but well defined behavior.

I'd suggest that the same be true here. We could consider the Bare mode to reuse ASID=0, and therefore software would have to do a sfence.vma only if there were stale mappings in that ASID that it didn't want to be used. My (completely uninformed) guess is that it shouldn't be too difficult for hardware to ignore global mappings when in Bare mode, but if people think otherwise, then the spec could say those also need to be flushed.

Overall I agree that this case doesn't need to be fast, but still should be consistent with how RISC-V does things in other places. And if it is possible to make TLB flushes restricted to a single ASID rather than global across all of them, then I think it makes sense to try to achieve that.

Jonathan


On Mon, Jul 20, 2020 at 11:58 PM Greg Favor via lists.riscv.org <gfavor=ventanamicro.com@...> wrote:
Comments below:

On Mon, Jul 20, 2020 at 8:14 PM Bill Huffman <huffman@...> wrote:

Hi Greg,

My sense is that the transitions from SvXX to Bare and from Bare to the same SvXX that was previously in force are special transitions.  One reason seems to me the extreme simplicity of Bare compared with other modes.  It's easier to switch.

Switches to/from Bare mode should be rare.  Typically one will switch from Bare mode to a translated mode as part of booting up an OS (e.g. Linux), and then will remain in that mode (until, say, the system crashes and must reboot).  Further, all such switches are performed under full software control.

Switching to/from M-mode on the other hand is frequent and often hardware initiated.  Also, any sfence.vma on M-mode exit would have to be after the exit (in potentially arbitrary code that happens to be at the return address).

Hence sfence'ing M-mode entry/exit is impractical as well as something that needs to be performant.   Whereas Bare mode entry/exit is rare and software-managed.

If we require sfence.vma after a switch to or from Bare, does that also mean we have to require one after a switch to or from M-mode?  If no, why is it different? 

If a high-performance design caches "translations" in all modes of operation (including M-mode) in the TLBs, then M-mode translations must be distinguished from S/HS/U mode translations, which must be distinguished from VS/VU mode translations.  That is a small set of three translation regimes (to use an ARMv8 term) for hardware to support and handle properly.

If one has to also distinguish Bare and non-Bare modes within the S/HS/U translation regime, that effectively becomes two separate translation regimes.  Similarly, with the H-extension and two-stage translations inside VM's, the VS/VU regime needs to become four regimes (the four combinations of Bare and non-Bare stage 1 and stage 2 modes).  Consequently TLB entries and surrounding logic now need to distinguish between and properly handle seven translation regimes.  All to handle rare cases.

That, like most things, is doable, but isn't the whole point of a RISC architecture to reduce hardware cost and complexity and shift that to software where the software burden is small and the performance cost is minimal?

Greg

P.S. One could imagine instead doing data-dependent implicit sfence.vma operation on CSR writes to the *atp registers, but besides being data-dependent (which RISC-V avoids when it comes to having data-dependent exceptions) that is a rather CISC'y thing to do.  Which goes back to my preceding point.
 

If yes, it will cost more to switch briefly to M-mode than I'd want it to.

      Bill

On 7/20/20 7:08 PM, Greg Favor wrote:
EXTERNAL MAIL

I would like to get people's views on the question of when is an sfence.vma required after changing the satp.mode field (to see what support there is for the following change/clarification in the Privileged spec).

Currently an sfence.vma is required when changing between SvXX modes, but is not required when changing to/from Bare mode.

In both cases there is an implicit wholesale change to the page tables, i.e. the translation of any given address generally has suddenly changed.

For some designs (that cache Bare mode translations in the TLBs for the sake of caching the PMA and PMP info for an address), having software be required to do an sfence.vma can simplify the hardware.

So the question is whether there should be architectural consistency in requiring sfence'ing after changing satp.mode (i.e. all mode changes require an sfence), versus having some mode cases being treated differently (i.e. changes to/from Bare mode not requiring an sfence)?

My (and Ventana's) bias is towards the former - for our sake and for other future high performance CPU designs by others.  But I'm interested to see if others feel similarly or not.

Greg


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Anup Patel
 

Hi Greg,

 

For per-HART edge sensitive interrupts, we can avoid the overflow status bit for each counter by keeping track of last read value and comparing this last read value in overflow interrupt handler.

 

I am suggesting one edge-sensitive interrupt for each HART routed through PLIC so that we don’t need too many PLIC interrupt lines.

 

Regards,

Anup

 

From: Greg Favor <gfavor@...>
Sent: 21 July 2020 11:36
To: Anup Patel <Anup.Patel@...>
Cc: Brian Grayson <brian.grayson@...>; alankao <alankao@...>; tech-privileged@...; andrew@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Regarding overflow interrupts as edge-sensitive interrupts:

 

It seems like this would require that there not be any overflow status bit in mhpmevent CSRs (or any alternative CSR), otherwise this bit would need to be cleared by software - which is equivalent to the "clearing a serviced overflow interrupt" that is trying to be avoided below.  Which seems generally undesirable; as well as how would the overflow interrupt handler figure out which counter(s) overflowed?  (Or are you imagining 32 separate per-counter overflow interrupt requests to a PLIC?)

 

In contrast, both x86 and ARMv8 have explicit counter overflow status bits that are the basis for generating a shared level-sensitive interrupt request and these bits must be cleared by handler software.

 

Lastly, note that in my proposal the handler (say in S/HS mode) can directly clear the Overflow status bits for the counter overflows that it has serviced - via the hpmoverflow CSR.  That avoids the SBI call that you rightly are wanting to avoid.

 

Greg

 

On Mon, Jul 20, 2020 at 10:39 PM Anup Patel <anup.patel@...> wrote:

Hi All,

 

The proposed SBI PMU extension is a set of APIs between M-mode and HS-mode (also between HS-mode and VS-mode) such that no RISC-V spec changes are required for existing HPMCOUNTERs.

 

The high-level view of SBI PMU extension is as follows:

  1. The SBI PMU calls are only for discovering and configuring HARDWARE/SOFTWARE counters and events
  2. The HARDWARE/SOFTWARE counters will be read directly from S-mode software without any SBI calls
  3. There are two suggested approaches for overflow handling:
    1. No overflow interrupt. The S-mode software will use periodic timer interrupts to track counter overflows. This will be imprecise software approach for detecting overflow
    2. Per-HART edge-triggered interrupt routed through PLIC (or some other platform interrupt controller). The per-HART interrupt will have to be routed through PLIC because we don’t have interrupts defined in MIP/MIE CSRs for counter overflows. Making this interrupts edge-triggered will not require S-mode software to clear the interrupt using SBI call

 

The SBI PMU extension does not prevent anyone in defining new RISC-V PMU extension. Although, whatever new RISC-V PMU extension is defined should consider H-extension and also provide dedicated CSRs for S-mode.

 

In future, when a new RISC-V PMU extension is available in RISC-V privilege spec, the SBI PMU extension will continue to exist and will be mostly used for SOFTWARE counters provided by SBI implementation (OpenSBI or Hypervisors) to the S-mode software.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Brian Grayson
Sent: 21 July 2020 10:45
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Hi, Alan.

 

My proposal is still a work in progress, hence has not been shared publicly, but is significantly based on a proven architecture with about 30 years in the field and a few billion shipping cores, if not more -- the PowerPC performance monitor implementation. I did the in-house Linux kernel patches and tool support for it about two decades ago at Motorola :) so I used to know it quite well, and can see how a similar approach solves some of the current problems that we all have encountered with the current RISC-V approach. I am fairly new to the RISC-V ecosystem, so I was not aware of the work that you have done in the past; thanks for the pointer to that.

 

The SBI PMU extensions is more about the API between what perf (or another tool) communicates, and how the M-mode software interprets it, and not about actually changing the hardware interpretation of mhpmevent bits, at least that was my understanding.

 

I am glad that so many of us are converging on all the same fundamental needs!

 

Brian

 

On Mon, Jul 20, 2020 at 7:38 PM alankao <alankao@...> wrote:

Hi Brian,

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

Regarding overflow interrupts as edge-sensitive interrupts:

It seems like this would require that there not be any overflow status bit in mhpmevent CSRs (or any alternative CSR), otherwise this bit would need to be cleared by software - which is equivalent to the "clearing a serviced overflow interrupt" that is trying to be avoided below.  Which seems generally undesirable; as well as how would the overflow interrupt handler figure out which counter(s) overflowed?  (Or are you imagining 32 separate per-counter overflow interrupt requests to a PLIC?)

In contrast, both x86 and ARMv8 have explicit counter overflow status bits that are the basis for generating a shared level-sensitive interrupt request and these bits must be cleared by handler software.

Lastly, note that in my proposal the handler (say in S/HS mode) can directly clear the Overflow status bits for the counter overflows that it has serviced - via the hpmoverflow CSR.  That avoids the SBI call that you rightly are wanting to avoid.

Greg

On Mon, Jul 20, 2020 at 10:39 PM Anup Patel <anup.patel@...> wrote:

Hi All,

 

The proposed SBI PMU extension is a set of APIs between M-mode and HS-mode (also between HS-mode and VS-mode) such that no RISC-V spec changes are required for existing HPMCOUNTERs.

 

The high-level view of SBI PMU extension is as follows:

  1. The SBI PMU calls are only for discovering and configuring HARDWARE/SOFTWARE counters and events
  2. The HARDWARE/SOFTWARE counters will be read directly from S-mode software without any SBI calls
  3. There are two suggested approaches for overflow handling:
    1. No overflow interrupt. The S-mode software will use periodic timer interrupts to track counter overflows. This will be imprecise software approach for detecting overflow
    2. Per-HART edge-triggered interrupt routed through PLIC (or some other platform interrupt controller). The per-HART interrupt will have to be routed through PLIC because we don’t have interrupts defined in MIP/MIE CSRs for counter overflows. Making this interrupts edge-triggered will not require S-mode software to clear the interrupt using SBI call

 

The SBI PMU extension does not prevent anyone in defining new RISC-V PMU extension. Although, whatever new RISC-V PMU extension is defined should consider H-extension and also provide dedicated CSRs for S-mode.

 

In future, when a new RISC-V PMU extension is available in RISC-V privilege spec, the SBI PMU extension will continue to exist and will be mostly used for SOFTWARE counters provided by SBI implementation (OpenSBI or Hypervisors) to the S-mode software.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Brian Grayson
Sent: 21 July 2020 10:45
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Hi, Alan.

 

My proposal is still a work in progress, hence has not been shared publicly, but is significantly based on a proven architecture with about 30 years in the field and a few billion shipping cores, if not more -- the PowerPC performance monitor implementation. I did the in-house Linux kernel patches and tool support for it about two decades ago at Motorola :) so I used to know it quite well, and can see how a similar approach solves some of the current problems that we all have encountered with the current RISC-V approach. I am fairly new to the RISC-V ecosystem, so I was not aware of the work that you have done in the past; thanks for the pointer to that.

 

The SBI PMU extensions is more about the API between what perf (or another tool) communicates, and how the M-mode software interprets it, and not about actually changing the hardware interpretation of mhpmevent bits, at least that was my understanding.

 

I am glad that so many of us are converging on all the same fundamental needs!

 

Brian

 

On Mon, Jul 20, 2020 at 7:38 PM alankao <alankao@...> wrote:

Hi Brian,

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

Ah, I see.  The 'marked' bit is state associated with and managed by the code running, not associated with a counter.  Then a counter could be configured (via its event selection) to count a selected event while the current 'marked' bit is set or not set.

As you note, for everyone using a perf-style approach, this 'marked' bit is not so useful.  And for the other bare-metal embedded customers that might desire this 'marked' bit, this bit of state needs to be added to some other existing or new CSR that is distinct from the current hpmcounter/mhpmevent CSR's.  That sounds like a separate (small) extension, orthogonal to the current discussion, targeted at this embedded segment of people.

Greg

On Mon, Jul 20, 2020 at 10:26 PM Brian Grayson <brian.grayson@...> wrote:
The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian
 


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Allen Baum
 

It's late; I'm missing something.
How is "mark" being set/cleared on entry/exit different than "active" in Greg's proposal being set/cleared on entry/exit ?

On Mon, Jul 20, 2020 at 10:26 PM Brian Grayson <brian.grayson@...> wrote:
The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian

On Mon, Jul 20, 2020 at 11:30 PM Greg Favor <gfavor@...> wrote:
Bill,

Hopefully my last email also answers your question.

Greg

On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Anup Patel
 

Hi All,

 

The proposed SBI PMU extension is a set of APIs between M-mode and HS-mode (also between HS-mode and VS-mode) such that no RISC-V spec changes are required for existing HPMCOUNTERs.

 

The high-level view of SBI PMU extension is as follows:

  1. The SBI PMU calls are only for discovering and configuring HARDWARE/SOFTWARE counters and events
  2. The HARDWARE/SOFTWARE counters will be read directly from S-mode software without any SBI calls
  3. There are two suggested approaches for overflow handling:
    1. No overflow interrupt. The S-mode software will use periodic timer interrupts to track counter overflows. This will be imprecise software approach for detecting overflow
    2. Per-HART edge-triggered interrupt routed through PLIC (or some other platform interrupt controller). The per-HART interrupt will have to be routed through PLIC because we don’t have interrupts defined in MIP/MIE CSRs for counter overflows. Making this interrupts edge-triggered will not require S-mode software to clear the interrupt using SBI call

 

The SBI PMU extension does not prevent anyone in defining new RISC-V PMU extension. Although, whatever new RISC-V PMU extension is defined should consider H-extension and also provide dedicated CSRs for S-mode.

 

In future, when a new RISC-V PMU extension is available in RISC-V privilege spec, the SBI PMU extension will continue to exist and will be mostly used for SOFTWARE counters provided by SBI implementation (OpenSBI or Hypervisors) to the S-mode software.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Brian Grayson
Sent: 21 July 2020 10:45
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Hi, Alan.

 

My proposal is still a work in progress, hence has not been shared publicly, but is significantly based on a proven architecture with about 30 years in the field and a few billion shipping cores, if not more -- the PowerPC performance monitor implementation. I did the in-house Linux kernel patches and tool support for it about two decades ago at Motorola :) so I used to know it quite well, and can see how a similar approach solves some of the current problems that we all have encountered with the current RISC-V approach. I am fairly new to the RISC-V ecosystem, so I was not aware of the work that you have done in the past; thanks for the pointer to that.

 

The SBI PMU extensions is more about the API between what perf (or another tool) communicates, and how the M-mode software interprets it, and not about actually changing the hardware interpretation of mhpmevent bits, at least that was my understanding.

 

I am glad that so many of us are converging on all the same fundamental needs!

 

Brian

 

On Mon, Jul 20, 2020 at 7:38 PM alankao <alankao@...> wrote:

Hi Brian,

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Brian Grayson
 

The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian


On Mon, Jul 20, 2020 at 11:30 PM Greg Favor <gfavor@...> wrote:
Bill,

Hopefully my last email also answers your question.

Greg

On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Brian Grayson
 

Hi, Alan.

My proposal is still a work in progress, hence has not been shared publicly, but is significantly based on a proven architecture with about 30 years in the field and a few billion shipping cores, if not more -- the PowerPC performance monitor implementation. I did the in-house Linux kernel patches and tool support for it about two decades ago at Motorola :) so I used to know it quite well, and can see how a similar approach solves some of the current problems that we all have encountered with the current RISC-V approach. I am fairly new to the RISC-V ecosystem, so I was not aware of the work that you have done in the past; thanks for the pointer to that.

The SBI PMU extensions is more about the API between what perf (or another tool) communicates, and how the M-mode software interprets it, and not about actually changing the hardware interpretation of mhpmevent bits, at least that was my understanding.

I am glad that so many of us are converging on all the same fundamental needs!

Brian

On Mon, Jul 20, 2020 at 7:38 PM alankao <alankao@...> wrote:
Hi Brian,

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>
Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.


Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.
I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.
Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

Bill,

Hopefully my last email also answers your question.

Greg


On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

Alan,

Hi. Comments below.

On Mon, Jul 20, 2020 at 7:39 PM alankao <alankao@...> wrote:
Hi Greg,

Questions:
- Is Active (bit[31]) any different from the inhibit register, functionally speaking?

At this surface it isn't, but in practice it is or wants to be for the following reasons:

- One wants the 'Active' state to be with all the other state of a counter so that it can all be context switched together by a hypervisor, as needed, when context switching a VM.  Having it (and all the other state bits) in mhpmevent means that they are context-switched "for free" when hpmcounter and mhpmevent are saved/restored.  Also, mixing all the 'Active' bits together in a common CSR (like mcountinhibit) complicates context-switching a subset of counters (since one has to explicitly insert and extract the relevant bits from that CSR).

- New/extra OpenSBI calls would be needed to support reading/writing such state that is in other places besides mhpmevent.

- When one brings into the picture setting and clearing the Active bit in response to hardware events (e.g. overflow by another counter or firing of a debug trigger in the Trigger Module), that can't be the current mcountinhibit bits (without changing the definition of that CSR).  In general, one can allow both hardware and software to control the activation and deactivation of active counting by a counter by setting/clearing one common bit that represents the 'active' state of the counter (and in a place that is naturally context-switched along with the rest of the counter state).  (Also, to be clear, this proposal isn't trying to standardize hardware control of Active bits, but it does provide a simple standardized basis for someone wanting to add in their own hardware counter control.)

- mcountinhibit is an M-mode-only CSR.  Any support for lower modes to directly enable/disable counting would require another/new CSR.

- Assume that we are making this HPM as an extension (maybe Zmhpm, Zshpm?). How is it possible that no extra registers are needed together with H Extention?  At least we need the counteren.

The mcounteren, scounteren, and hcounteren CSR's already exist (between the base Privileged spec and the current H-extension draft).  Nothing additional is needed for this counter extension.

Greg

901 - 920 of 1130