Re: [PATCH 1/1] RAS features for OS-A platform server extension

Abner Chang

Greg Favor <gfavor@...> 於 2021年6月23日 週三 上午9:51寫道:
On Tue, Jun 22, 2021 at 5:34 PM Kumar Sankaran <ksankaran@...> wrote:
I think the primary requirements here are the following:
- The platform should provide the capability to configure each RAS
error to trigger firmware-first or OS-first error interrupt.

Yes.  Which is just a software matter of configuring the interrupt controller accordingly.
Does this mean the interrupt controller would integrate all RAS events (HART, PCI, I/O, memory and etc.)? 
Or there would be a separate hardware box that manages all RAS error events, and maybe some error signals output from that box and connected to the interrupt controller? The interrupt controller just provides the mechanism to morph those error signals to FFM or OSF interrupt?
- If the RAS error is handled by firmware, the firmware should be able
to choose to expose the error to S/HS mode for further processes or
just hide the error from S/HS software.
Is there a need to provide all the other details?

Agreed.  The details and mechanics don't need to be discussed (unless they are mandating specific mechanics - which I don't believe is the case). 

> Yes, to mask the RAS error interrupt or even not to create the log (in RAS status registers or CSR) that OEM doesn't consider that is a useful or important error to product.

This is fine

Maybe just say that "Logging and/or reporting of errors can be masked".

Can we summarize the requirement to
- RAS errors should be capable of interrupting TEE.
This is ok for now because there is no hardware signal defined for triggering TEE right? I have more comments on this below. 

This implies a requirement to have a TEE - and defining what constitutes a compliant TEE in the platform spec.  Btw, what distinguishes the TEE from "firmware"?
Please correct me on ARM part if I am wrong.
The equivalent mechanism to TEE is SMM on X86 and TZ on ARM. I don't quite understand how ARM TZ works, however on X86 system, all cores are brought to SMM environment when SMI is triggered. ARM has the equivalent event which is SMC, right?
The above is called management mode (MM) which is defined in UEFI PI spec. MM has the highest privilege than CR0 on X86 and EL3 on ARM. The MM is OS agnostic and the MM event halts any processes and gets the core into management mode to run the firmware code. The environment of MM (data and code) can only be accessed when the core is in MM. Firmware always uses this for the secure stuff, power management, and of course the RAS.

I would like to add one more thing to the RAS requirement but I don't know how to describe it properly because seems we don't have the MM event on RISC-V such as SMI and SMC which can bring the system to MM. So there are two scenarios for RAS on the firmware first model.
- If the platform doesn't have TEE and the hardware event to trigger TEE:
  If the RAS event is configured to firmware first mode, the platform should be able to trigger M-Mode exception to all harts in the physical processor. This prevents the subsequent RAS error propagated by other harts that access the problematic hardware (PCI, memory and etc.)

- If the platform has TEE and the hardware event to trigger TEE:
    If the RAS event is configured to firmware first mode, the platform should be able to trigger TEE event to all harts in the physical processor and bring all harts into TEE. This prevents the subsequent RAS error propagated by other cores which access the problematic hardware (PCI, memory and etc.) 

The PCIe AER errors have been handled OS first on X86 systems. If I
recall correct, ARM64 initially made PCIe AER errors firmware first
and then later changed to OS first to be compliant with what's already
out there.
The exact manner of handling these PCIe AER errors is also OEM
dependent. Some OEMs will handle it OS first while making a call to
the firmware to take additional corrective action of notifying the BMC
and such. Some ARM64 implementations handle this firmware first and
notify the BMC and then notify the OS.
From a RISC-V platforms requirements perspective, my suggestion is we
simply mention the capability of all errors to have support for
firmware first and OS first and leave it at that.

Agreed all around.



Join to automatically receive all group messages.