On Mon, Dec 13, 2021 at 05:11:38PM -0800, Greg Favor wrote:
I think this whole RAS-related topic in the current platform draft was to
establish some form of modest RAS requirement (versus no requirement) until
a proper RAS arch spec exists. Although even then (assuming that arch spec
is like x86 and ARM RAS specs that are just concerned with standardizing
RAS registers for logging and the mechanisms for reporting errors), there
still won't be any minimum requirement for actual error detection and
correction.
I agree. I think the RAS ISA would want to be about standardized error logging and reporting but not mandate what errors are detected/corrected and how they are corrected or contained. For example, even in x86 and ARM space there are many product segments which have varying degrees of resilience but the RAS architecture flexibly covers the full spectrum of implementations between multiple x86 and ARM vendors.
Fundamentally, should the Server platform spec mandate ANY error
detection/correction requirements, or just leave it as a wild west among
hardware developers to individually and eventually figure out where the
line exists as far as the basic needs for RAS in *Server*-compliant
platforms? And leave it for system integrators to discover that some
Server-compliant hardware has less than "basic" RAS?
This was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement.
BUT if the platform spec is ONLY trying to establish hardware/software
interoperability, and not also match up hardware and software expectations
regarding other areas of functionality such as RAS, then that answers the
question. My own leaning is towards trying to address the latter versus
the narrower view that the only concern is software interoperability. But
I understand the arguments both ways.
My understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview. Aspects like "every cache must have single bit error correction" or "must implement SECDED-ECC" may not be necessary to acheive this objective. For example, an implementation may have two levels caches where instructions may be cached and for the lowest level the implementation may only implement parity but on a error refetch from a higher level cache or DDR where there might be ECC. So for such an implementation to require ECC in its instruction cache seems not required - the machine is meeting its FIT rate objectives through other means.
regards
ved