Re: [PATCH 1/1] RAS features for OS-A platform server extension
On Thu, Jun 17, 2021 at 11:13 AM Allen Baum <allen.baum@...> wrote:
Nowadays single-bit errors are far from rare. There will always be people that run Linux and are willing to accept occasional silent corruptions and whatever mysterious application/data corruptions occur as a result. But for a standardized server-class platform spec, this is a rather low "table stakes" bar to set. Virtually no customer of a "server-class" platform will be comfortable without that (especially since the x86 and ARM alternatives provide at least that).
Parity (and invalidate on error detection) suffices for I and WT D caches; and ECC is used on WB D caches. Even L1 D caches (which is one argument for doing a WT L1 D cache with parity, but the majority of people still do WB L1 D caches with ECC).
Understandably some people don't want to deal with ECC on a WB DL1, and parity or nothing may be fine for less-than server-class systems.
Somewhat analogous, TSMC imposes similarly expressed requirements wrt having redundancy in all the RAMs. Even just one non-redundant 64 KiB cache can pretty much use up what is allowed to not have redundancy.
In any case, the Base platform spec should allow people to make whatever choice they want (and live with the consequences). But to be competitive and to meet customer expectations (especially in a multi-core world), the Server spec needs to require a higher-than-nothing bar.
A functional requirement is simple to specify and aligns with standard industry practices. The alternatives get more involved and in practice won't provide much of any value over the functional requirement (for server-class systems).
This is a baseline requirement - aligned with common/dominant industry practice. Conversely it is not a dominant industry practice to protect flop-based register files (or flop-based storage structures in general). (Latch-based register files, depending on whether the bitcell is more SRAM-like or flop-like, fall in one category or the other.)
Nowadays even the aggregate error rate or MTBF due to flop soft errors is not small. But thankfully for most designs that MTBF component is acceptable within typical MTBF budgets.
As far as instead specifying an MTBF requirement, one then gets into system-wide issues and overall MTBF budgets, where it gets spent, what about the technology dependence of all this, and .... Plus that effectively would provide little guidance to CPU designers as to what is their individual MTBF budget. Or, conversely, one can probably have long discussions/arguments about what is the right MTBF number to require at the level of a single CPU core.
But at the end of the day very few or virtually no customer of a server-class system is going to accept a product that doesn't even have single-bit error protection on the cache hierarchy.