I
have NOT been working on a RISC-V performance monitoring
proposal, but I've been involved with performance
monitoring first as a user then is an architect for many
years and at several companies.
I would like to draw this group's attention to some
features of Intel x86 performance monitoring that turned
out pretty successful.
First, you're already talking about hardware
performance monitoring interrupts for statistical
profiling. Good. A few comments on that below.
But, I think one of the best bang for buck performance
monitoring features of Sadie six MIN/performance
monitoringis the performance counter event filtering
---+ Performance event filtering and transformation logic
before counting
Next, I think one of the best bang for buck performance
monitoring features of x86 EMON performance monitoring is
the performance counter event filtering. RISC-V has only the
most primitive version of this.
Every x86 performance counter has per counter event select
logic.
In addition to that logic, there is a mask that specifies
what modes to cvount in - User, OS, hypervisor. I see that
some of the Ri5 proposals also have that. Good.
But more filtering is also provided:
Each counter has a "Counter Mask" CMASK - really, a
threshold. When non-zero, this is compared to the count of
the selected event in any given cycle. If >= CMASK, the
counter is incremented by 1; if less, no increment.
=> This comparison allows a HW event to be used to
profile things like "Number of cycles in which 1, 2 or more,
3 or more ... events happened - e.g. how often you are able
to to acheive superscalar execution. In vectors, it might
count how many vector elements are masked in or out. If
you have events that correspond to buffer occupancy, you can
profile to see where the buffer is full or not.
INV - a bit that allows the CMASK comparison to be inverted.
=> so that you can count event > threshold, and event
< threshold.
I would really have liked to have >, <, and ==
threshold. And also the ability to increment by 1 if
exceeding threshold, or by the actual count that exceeds the
threshold. The former allows you to find where you are
getting good superscalar behavior or not, the latter allows
you to determine the average when exceeding the threshold or
not. When I do find this I had to save hardware.
This masked comparison allows you to get more different
types of events, for events that occur more than one per
cycle. That's pretty good, abut it doesn't help you with
scarce events, t events that only occur once every four or
eight or an cycles. Later, Intel added what I call
"push-out" profiling: when the comparison condition is
met, e.g. when no instruction retires, a counter that
increments one every clock cycle starts ; when the condition
changes, the value of that counter is what is recorded, and
naturally subject to all of the natural filtering. That was
too much hardware for me to add in 1991, but it proved very
useful.
My first instinct is always to minimize hardware cost for
performance monitoring hardware. The nice thing about the
filtering logic at the performance counters is that it
removed the hardware cost from the individual unit like the
cache,
and left it in the performance monitoring unit. (The
question of centralized versus decentralized performance
counters is always an issue. Suffice it to say that Intel
P6 had for centralized performance counters, to save
hardware;
Pentium 4 went to a fully distributed performance counter
architecture, but users hated it, so until return to the
centralized model at least architecturally, although
microarchitecture implementations might be decentralized )
More filtering logic: each count has an E, edge select bit.
This counts when the condition described by the privilege
level mask and CMASK comparison changes. Using such edge
filtering, you can determine the average length of bursts,
e.g. the average length of a period that you have not been
able to execute any instruction, and so on. Simple filters
can give you average lengths and occupancies; fancier and
more expensive stuff is necessary to actually determine a
distribution.
Certain events themselves are sent to the performance
counters as bitmasks. E.g. the utilization of the execution
unit ports as a bitmask - on the original P6 {
ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on
modern machines. By controlling the UMASK field of the
filter control logic for each performance counter, you could
specify to count all instruction dispatches, or just loads,
and so on. Changing the UMASK field allowed you to profile
to find out which parts of the machine were being used and
which not. (This proved successful enough to get me and
the software guy who eventually started using it in
achievement award.)
If I were to do it over again I would have a generic POPCNT
as part of the filter logic, as well as the comparison.
Finally, simple filter stuff:
INT - Performance interrupt enable
PC - pin control - this predated me: it toggled an external
pin when the performance counter overflowed. The very first
EMON event sampling took that external pin and wired it back
to the NMI pin of the CPU. Obviously, it is better to have
internal logic for performance monitoring interrupts.
Nevertheless, there is still a need for externally visible
performance event sampling, e.g. freeze performance events
outside the CPU, in the fabric, or in I/O devices. Exactly
what those are is obviously implementation dependent, but
it's still good to have a standard way of controlling such
implementation dependent features. I call this the "pin
architecture", and IMHO maintaining such system
compatibility was as much a factor in Intel's success as
instruction set architecture.
---+ Performance counter freeze
There are always several performance counters. At least two
per privilege level. At least a pair, so you can compute
things like cashless rates and other ratios. But not
necessarily dedicated to any specific privilege level,
because that would be wasteful: you can study things a hell
of a lot more quickly if you can use all of the performance
counters, when other modes are not using them.
When you have several performance counters, you are often
measuring things together. You therefore need the ability to
freeze them all at the same time. This means that you need
to have all of the enable bits for all of the counters, or
at least a subset, in the same CSR. If you want, the
enable bit can be in both the per counter control registers
and in a central CSR - i.e. there can be multiple views of
the same bit indifferent CSRs.
Performance analysts would really like the ability to freeze
multiple CPUs performance counters at the same time. This is
one motivation for that pin control signal
---+ Precise performance monitoring inputs - You can only
wish!
When you are doing performance counter event interrupt based
sampling, it would be really nice if the interrupt occurred
exactly at the instruction that had the event.
If you can do that, great. However, it takes extra hardware
to do that. Also, some events do not in any way correspond
to a retired instruction - think events that occur on
speculative instructions that never graduate/retire. Again,
you can create special registers that record, say, the
program counter of such a speculative instruction, but that
is again extra hardware.
IMHO there is zero chance that all implementations,
particularly the lowest cost of limitations, will make all
performance events precise.
At the very least there should be a way of discovering
whether an event is precise or not.
Machine check architectures have the same issue.
---+ What should the priority of the hardware performance
monitoring interrupts be?
One of the best things I did for Intel was punt on this
issue: because I was also the architect in charge of the
APIC interrupt controller, I provided a special LVT
interrupt register just for the performance monitoring
interrupt.
This allowed the performance monitoring interrupt to use all
of the APIC features, all of those that made sense. For
example, the performance monitoring interrupt that Linux
uses is just the normal interrupt of some priority. But as
I mentioned above, the very first usage of performance
monitoring interrupts used NMI, and was therefore able to
profile code that had interrupts blocked. The interrupt
would be directed to SMM, the not very good baby virtual
machine monitor, allowing performance monitoring to be done
independent of the operating system. Very nice, when you
don't have source code for the operating system. And so on.
I can't remember, but it is possible that the interrupt
could be directed to other processors other than the local
processor. However, that would not subsume the externally
visible pin control, because the hardware pin can be a lot
less expensive a lot more precise than signaling and enter
processor interrupt.
I used a similar approach for machine check interrupts,
which could also be directed to the operating system, NMI,
SMM, hypervisor,…
By the way: I think Greg Favor said that x86 is performance
monitoring interrupts are level sensitive. That is not
strictly speaking true: whether they are level sensitive or
not is programmed into the AIPAC local vector table. You can
make it all sensitive or edge triggered.
Obviously, however, when there are multiple performance
counters bearing the same interrupt, you need to know which
counter overflowed. Hence the sticky bits that Greg noticed
in the manuals.
---+ Fancier stuff
The above is mostly about getting the most out of simple
performance counters: providing filter logic so that you
can get the most insight out of the limited number of
events;
providing enables in a central place so that you can freeze
multiple counters at the same time; allowing the
performance counter interrupts to be directed not just a
different privilege levels but in different interrupt
priorities including NMI, and possibly also external
hardware.
There's a lot more stuff that can be done to help
performance monitoring. Unfortunately, I have always worked
at a place where I had to reduce the performance monitoring
hardware cost as much as possible. I am sure, however,
that many of you are familiar with fancier performance
monitoring features, such as
+ Histogram counting (allowing you to count distributions
without making multiple runs)
=> the CMASK comparators allowing very simple form of
this, assuming you have enough performance counters. Actual
histogram counters can do this more cheaply.
+ Cycle attribution - defining performance events so that
you can actually say things like "X% of cycles are spent
waitying for memory".
IMHO the single most important "advanced" performance
monitoring feature is what I call "longitudinal
profiling". AMD Instruction Based Sampling (IBS), DEC
ProfileMe, ARM SPE (Statistical Profiling Extension). The
basic idea is to set a bit on some randomly selected
instruction package somewhere high up in the pipeline, e.g.
yet instruction fetch, and then let that bit flow down the
pipeline, sampling things as it goes. E.g. you might sample
the past missed latency or address, or whether it produced
a stall in interaction with a different marked instruction.
This sort of profiling is quite expensive, e.g. requiring a
bit in many places in the pipeline, as well as registers to
record the sample data, but it provides a lot of insight: it
can give you distributions and averages, it can tell you
what interactions between instructions are causing problems.
However, if RISC-V cannot yet afford to do longitudinal
profiling, the performance counter filter logic that I
described above is low hanging fruit, much cheaper.
From: Alankao
<alankao@...>
Sent: Monday, July 20, 2020 5:43PM
To: Tech-Privileged
<tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal
to enhance RISC-V HPM (Hardware Performance Monitor)
On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail
client, so please forgive my spam.
Hi Brian,
> I have been working on a similar proposal myself,
with overflow, interrupts, masking, and delegation. One of
the key differences in my proposal is that it unifies
> each counter's configuration control into a
per-counter register, by using mhpmevent* but with some
fields reserved/assigned a meaning. <elaborating>
Thanks for sharing your experience and the elaboration.
The overloading-hpmevent idea looks like the one in the
SBI PMU extension threads in Unix Platform Spec TG
by
Greg. I have a bunch of questions. How was your
proposal later? Was it discussed in public? Did you manage
to implement your idea into a working HW/S-mode SW/U-mode
SW solution? If so, we can compete with each other by real
benchmarking the LoC of the perf patch (assuming you do it
on Linux) and the system overhead running a long perf
sample.
> Another potential discussion point is, does overflow
happen at 0x7fffffffffffffff -> 0x8000000000000000, or
at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow,
the count is wholly contained in an XLEN-wide register
treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but
I know some people prefer to (or are used to?) have the
overflow bit as a 33rd or 65th bit in a different
> register.
I have no bias here as long as the HPM interrupt can be
triggered. But somehow it seems to me that you assume the
HPM registers are XLEN-width but actually they are not
(yet?). The spec says they should be 64-bit width
although obviously nobody implements nor remember that.
> Lastly, a feature I have enjoyed using in the past
(on another ISA) is the concept of a 'marked' bit in the
mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to
mstatus, but the rest of the kernel just needs to save and
restore this bit on context switches, without knowing its
> purpose.
Which architecture/OS are you referring to here?
Through this discussion, we will understand which idea is
the community prefer to: adding CSRs, overloading existing
hpmevents, or any balanced compromise. I believe the
ultimate goal of this thread should be determining what
the RISC-V HPM should really be like.
Best,
Alan
I apologize for some of the language errors that occur
far too frequently in my email. I use speech recognition
much of the time, and far too often do not catch
misrecognition errors. This can be quite embarrassing,
amusing, and/or confusing. Typical errors are not spelling
but homonyms, words that sound the same - e.g. "cash"
instead of "cache".