Re: [PATCH 1/1] RAS features for OS-A platform server extension


Greg Favor
 

On Thu, Jun 17, 2021 at 11:13 AM Allen Baum <allen.baum@...> wrote:
Is it acceptable to everyone that all single bit errors on all caches must be correctable?

Nowadays single-bit errors are far from rare.  There will always be people that run Linux and are willing to accept occasional silent corruptions and whatever mysterious application/data corruptions occur as a result.  But for a standardized server-class platform spec, this is a rather low "table stakes" bar to set.  Virtually no customer of a "server-class" platform will be comfortable without that (especially since the x86 and ARM alternatives provide at least that).
 
That really affects designs in fundamental ways for L1 caches (as opposed to simply detecting).

Parity (and invalidate on error detection) suffices for I and WT D caches; and ECC is used on WB D caches.  Even L1 D caches (which is one argument for doing a WT L1 D cache with parity, but the majority of people still do WB L1 D caches with ECC).

Understandably some people don't want to deal with ECC on a WB DL1, and parity or nothing may be fine for less-than server-class systems.
 
Not as big a concern for L2 and above.
Speaking from my Intel experience, the rule was expressed as failures per year - and if an L1 cache was small enough to exceed that number, then it didn't need correction.

Somewhat analogous, TSMC imposes similarly expressed requirements wrt having redundancy in all the RAMs.  Even just one non-redundant 64 KiB cache can pretty much use up what is allowed to not have redundancy.

In any case, the Base platform spec should allow people to make whatever choice they want (and live with the consequences).  But to be competitive and to meet customer expectations (especially in a multi-core world), the Server spec needs to require a higher-than-nothing bar.
 
So, it might be useful to have a measurement baseline like that, rather than an absolute requirement.

A functional requirement is simple to specify and aligns with standard industry practices.  The alternatives get more involved and in practice won't provide much of any value over the functional requirement (for server-class systems).

The argument is why are you requiring ecc correction on this - and not the register file, or CSRs?

This is a baseline requirement - aligned with common/dominant industry practice.  Conversely it is not a dominant industry practice to protect flop-based register files (or flop-based storage structures in general).  (Latch-based register files, depending on whether the bitcell is more SRAM-like or flop-like, fall in one category or the other.)

The reason is they're small enough that failures are unlikely - and that's what your rationale should be stated.

Nowadays even the aggregate error rate or MTBF due to flop soft errors is not small.  But thankfully for most designs that MTBF component is acceptable within typical MTBF budgets.

As far as instead specifying an MTBF requirement, one then gets into system-wide issues and overall MTBF budgets, where it gets spent, what about the technology dependence of all this, and ....  Plus that effectively would provide little guidance to CPU designers as to what is their individual MTBF budget.  Or, conversely, one can probably have long discussions/arguments about what is the right MTBF number to require at the level of a single CPU core.

But at the end of the day very few or virtually no customer of a server-class system is going to accept a product that doesn't even have single-bit error protection on the cache hierarchy.

Greg

Join tech-unixplatformspec@lists.riscv.org to automatically receive all group messages.