On Mon, Dec 13, 2021 at 05:11:38PM -0800, Greg Favor wrote:
I think this whole RAS-related topic in the current platform draft was toI agree. I think the RAS ISA would want to be about standardized error logging and reporting but not mandate what errors are detected/corrected and how they are corrected or contained. For example, even in x86 and ARM space there are many product segments which have varying degrees of resilience but the RAS architecture flexibly covers the full spectrum of implementations between multiple x86 and ARM vendors.
Fundamentally, should the Server platform spec mandate ANY errorThis was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement.
BUT if the platform spec is ONLY trying to establish hardware/softwareMy understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview. Aspects like "every cache must have single bit error correction" or "must implement SECDED-ECC" may not be necessary to acheive this objective. For example, an implementation may have two levels caches where instructions may be cached and for the lowest level the implementation may only implement parity but on a error refetch from a higher level cache or DDR where there might be ECC. So for such an implementation to require ECC in its instruction cache seems not required - the machine is meeting its FIT rate objectives through other means.