I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.
In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.
From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.
Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.
Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.
As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.