Re: [PATCH 1/1] RAS features for OS-A platform server extension

Allen Baum

Is it acceptable to everyone that all single bit errors on all caches must be correctable?
That really affects designs in fundamental ways for L1 caches (as opposed to simply detecting).
Not as big a concern for L2 and above.
Speaking from my Intel experience, the rule was expressed as failures per year - and if an L1 cache was small enough to exceed that number, then it didn't need correction.
So, it might be useful to have a measurement baseline like that, rather than an absolute requirement.

The argument is why are you requiring ecc correction on this - and not the register file, or CSRs?
The reason is they're small enough that failures are unlikely - and that's what your rationale should be stated.
There will be platforms that are much more demanding (safety critical) where duplication is required, or majority voting.
I didn't think that we were talking about those application areas.

On Thu, Jun 17, 2021 at 8:56 AM Abner Chang <renba.chang@...> wrote:

Kumar Sankaran <ksankaran@...> 於 2021年6月16日 週三 上午8:17寫道:
Signed-off-by: Kumar Sankaran <ksankaran@...>
 riscv-platform-spec.adoc | 42 ++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/riscv-platform-spec.adoc b/riscv-platform-spec.adoc
index 4c356b8..d779452 100644
--- a/riscv-platform-spec.adoc
+++ b/riscv-platform-spec.adoc
@@ -19,18 +19,6 @@
 // table of contents

-// document copyright and licensing information
-// changelog for the document
-// Introduction: describe the intent and purpose of the document
-// Profiles: (NB: content from very first version)
 == Introduction
 The platform specification defines a set of platforms that specify requirements
 for interoperability between software and hardware. The platform policy
@@ -68,11 +56,13 @@ The M platform has the following extensions:
 |SBI       | Supervisor Binary Interface
 |UEFI      | Unified Extensible Firmware Interface
 |ACPI      | Advanced Configuration and Power Interface
+|APEI      | ACPI Platform Error Interfaces
 |SMBIOS    | System Management Basic I/O System
 |DTS       | Devicetree source file
 |DTB       | Devicetree binary
 |RVA22     | RISC-V Application 2022
 |EE        | Execution Environment
+|OSPM      | Operating System Power Management
 |RV32GC    | RISC-V 32-bit general purpose ISA described as RV32IMAFDC.
 |RV64GC    | RISC-V 64-bit general purpose ISA described as RV64IMAFDC.
@@ -87,6 +77,7 @@ The M platform has the following extensions:
 |link:[RVA22 Specification]
                                        | TBD
 |link:[EBBR Specification]
                                        | v2.0.0-pre1
Specification]              | v6.4
Specification]              | v6.4
Specification]    | v3.4.0
 |link:[Platform Policy]
                                        | TBD
@@ -504,6 +495,30 @@ delegate the virtual supervisor timer interrupt
to 'VS' mode.

 ==== RAS
+All the below mentioned RAS features are required for the OS-A platform server
+*  Main memory must be protected with SECDED-ECC +
+*  All cache structures must be protected +
+** single-bit errors must be detected and corrected +
+** multi-bit errors can be detected and reported +
+* There must be memory-mapped RAS registers associated with these protected
+structures to log detected errors with information about the type and location
+of the error +
+* The platform must support the APEI specification to convey all error
+information to OSPM +
+* Correctable errors must be reported by hardware and either be corrected or
+recovered by hardware, transparent to system operation and to software +
+* Hardware must provide status of these correctable errors via RAS registers +
+* Uncorrectable errors must be reported by the hardware via RAS error
+registers for system software to take the needed corrective action +
+* Attempted use of corrupted (uncorrectable) data must result in a precise
+exception on that instruction with a distinguishing custom exception cause
+code +
+* Errors logged in RAS registers must be able to generate an interrupt request
+to the system interrupt controller that may be directed to either M-mode or
+S/HS-mode for firmware-first versus OS-first error reporting +
+* PCIe AER capability is required +

Hi Kumar,
I would like to add something.
In order to support the OEM RAS policy,
- The platform should provide the capability to configure each RAS error to trigger firmware-first or OS-first error interrupt.
- If the RAS error is handled by firmware, the firmware should be able to choose to expose the error to S/HS mode for further processes or just hide the error from S/HS software. This requires some mechanisms provided by the platform and the mechanism should be protected by M-mode.
- Each RAS error should be able to mask through RAS configuration registers.
- We should also consider triggering RAS error interrupt to TEE which is where the firmware management mode resides.

- The baseline PCIe error or AER interrupt is able to be morphed to firmware-first interrupt before delivering to H/HS software. This gives firmware a chance to log the error, correct the error or hide the error from S/HS software according to OEM RAS policy.
Besides memory and PCIe RAS, do we have RAS errors for the processor/HART? such as IPI error or some CE/UC/UCR to HART locally?


 // M Platform
 == M Platform
@@ -593,6 +608,3 @@ also implement PMP support.
 When PMP is supported it is recommended to include at least 4 regions, although
 if possible more should be supported to allow more flexibility. Hardware
 implementations should aim for supporting at least 16 PMP regions.
-// acknowledge all of the contributors

Join to automatically receive all group messages.