Re: Platform specification questions

Ved Shanbhogue

On Mon, Dec 13, 2021 at 02:00:38PM -0800, Greg Favor wrote:
On Mon, Dec 13, 2021 at 11:06 AM Vedvyas Shanbhogue <ved@...>

The current wording is the following.
All cache structures must be protected.
single-bit errors must be detected and corrected.
multi-bit errors can be detected and reported.
Platforms are free to implement more advanced features than the
minimalistic requirements that are mandated here. So we should be OK.
Could I suggest:
"Cache structures must be protected to address the Failure-in-time (FIT)
requirements. The protection mechanisms may included single-bit/multi-bit
error detection and/or single/multi-bit error detection/correction schemes,
replaying faulting instructions, lock-step execution, etc."
This seems like a toothless and qualitative mandate since no FIT
requirements are specified. It can be a suggestion, although it's just a
qualitative suggestion. It's essentially just saying "don't forget to
consider FIT requirements". One can imagine a hundred such reminders that
factor into high-end silicon design. Why highlight just this one?

The reference to "cache structures" is also incomplete - as well as
ambiguous as to whether it refers just to caches (in the most popular sense
of the word) or also to other caching structures like TLBs as well . Most
all RAM-based structures in which an error can result in functional
failure, need to be protected. Although one can take the view that the
above text was just trying to express a minimum requirement that doesn't
encompass all RAM-based structures. My suggestion would be something like
the following two statements:
Totally agree that the term "cache structure" is ambigous and variety of caches may be built. How caches are built should also be transparent to the ISA, software, and the platform in general. Like you said reliability engineering is not something that affects software compatibility or hardware/software contracts. And as you rightly pointed out, caches are most obvious but a reliable system will need more such as right thermal engineering, stable clock/voltage delivery, right ageing guardbands, use of gray codes when appropriate, voltage monitors, timing margin sensors, protection on data/control buses, protection on register files, protection on internal data paths, etc.

I would be totally okay with saying drop this whole paragraph.

Mandate: *At a minimum, caching structures must be protected such that
single-bit errors are detected and corrected by hardware.*
Would a mandate be overeaching and why limit it to caches then?

A product may define its reliability goals and may reason that a certain cache need not be protected due to various reasons like the technology in which the product is built, the altitude at which it is supposed to be used, the architectural vulnerability factor computed for that structure, etc.

I am failing to understand how would we be adding to or removing from the OS-A platform compatibility goals which is to be able to boot a shrink wrapper server operating system by trying to provide a mandate on how it implements reliability?


Join { to automatically receive all group messages.