Re: [PATCH 1/1] RAS features for OS-A platform server extension

Abner Chang

Kumar Sankaran <ksankaran@...> 於 2021年6月16日 週三 上午8:17寫道:
Signed-off-by: Kumar Sankaran <ksankaran@...>
 riscv-platform-spec.adoc | 42 ++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/riscv-platform-spec.adoc b/riscv-platform-spec.adoc
index 4c356b8..d779452 100644
--- a/riscv-platform-spec.adoc
+++ b/riscv-platform-spec.adoc
@@ -19,18 +19,6 @@
 // table of contents

-// document copyright and licensing information
-// changelog for the document
-// Introduction: describe the intent and purpose of the document
-// Profiles: (NB: content from very first version)
 == Introduction
 The platform specification defines a set of platforms that specify requirements
 for interoperability between software and hardware. The platform policy
@@ -68,11 +56,13 @@ The M platform has the following extensions:
 |SBI       | Supervisor Binary Interface
 |UEFI      | Unified Extensible Firmware Interface
 |ACPI      | Advanced Configuration and Power Interface
+|APEI      | ACPI Platform Error Interfaces
 |SMBIOS    | System Management Basic I/O System
 |DTS       | Devicetree source file
 |DTB       | Devicetree binary
 |RVA22     | RISC-V Application 2022
 |EE        | Execution Environment
+|OSPM      | Operating System Power Management
 |RV32GC    | RISC-V 32-bit general purpose ISA described as RV32IMAFDC.
 |RV64GC    | RISC-V 64-bit general purpose ISA described as RV64IMAFDC.
@@ -87,6 +77,7 @@ The M platform has the following extensions:
 |link:[RVA22 Specification]
                                        | TBD
 |link:[EBBR Specification]
                                        | v2.0.0-pre1
Specification]              | v6.4
Specification]              | v6.4
Specification]    | v3.4.0
 |link:[Platform Policy]
                                        | TBD
@@ -504,6 +495,30 @@ delegate the virtual supervisor timer interrupt
to 'VS' mode.

 ==== RAS
+All the below mentioned RAS features are required for the OS-A platform server
+*  Main memory must be protected with SECDED-ECC +
+*  All cache structures must be protected +
+** single-bit errors must be detected and corrected +
+** multi-bit errors can be detected and reported +
+* There must be memory-mapped RAS registers associated with these protected
+structures to log detected errors with information about the type and location
+of the error +
+* The platform must support the APEI specification to convey all error
+information to OSPM +
+* Correctable errors must be reported by hardware and either be corrected or
+recovered by hardware, transparent to system operation and to software +
+* Hardware must provide status of these correctable errors via RAS registers +
+* Uncorrectable errors must be reported by the hardware via RAS error
+registers for system software to take the needed corrective action +
+* Attempted use of corrupted (uncorrectable) data must result in a precise
+exception on that instruction with a distinguishing custom exception cause
+code +
+* Errors logged in RAS registers must be able to generate an interrupt request
+to the system interrupt controller that may be directed to either M-mode or
+S/HS-mode for firmware-first versus OS-first error reporting +
+* PCIe AER capability is required +

Hi Kumar,
I would like to add something.
In order to support the OEM RAS policy,
- The platform should provide the capability to configure each RAS error to trigger firmware-first or OS-first error interrupt.
- If the RAS error is handled by firmware, the firmware should be able to choose to expose the error to S/HS mode for further processes or just hide the error from S/HS software. This requires some mechanisms provided by the platform and the mechanism should be protected by M-mode.
- Each RAS error should be able to mask through RAS configuration registers.
- We should also consider triggering RAS error interrupt to TEE which is where the firmware management mode resides.

- The baseline PCIe error or AER interrupt is able to be morphed to firmware-first interrupt before delivering to H/HS software. This gives firmware a chance to log the error, correct the error or hide the error from S/HS software according to OEM RAS policy.
Besides memory and PCIe RAS, do we have RAS errors for the processor/HART? such as IPI error or some CE/UC/UCR to HART locally?


 // M Platform
 == M Platform
@@ -593,6 +608,3 @@ also implement PMP support.
 When PMP is supported it is recommended to include at least 4 regions, although
 if possible more should be supported to allow more flexibility. Hardware
 implementations should aim for supporting at least 16 PMP regions.
-// acknowledge all of the contributors

Join to automatically receive all group messages.