Re: [PATCH 1/1] RAS features for OS-A platform server extension

Greg Favor

On Thu, Jun 17, 2021 at 8:56 AM Abner Chang <renba.chang@...> wrote:
- The platform should provide the capability to configure each RAS error to trigger firmware-first or OS-first error interrupt.
- If the RAS error is handled by firmware, the firmware should be able to choose to expose the error to S/HS mode for further processes or just hide the error from S/HS software. This requires some mechanisms provided by the platform and the mechanism should be protected by M-mode.

I would have thought that this is just a software issue.  What kind of hardware mechanism do you picture being needed?
- Each RAS error should be able to mask through RAS configuration registers.

By "mask" do you mean masking of generation of an error interrupt?
- We should also consider triggering RAS error interrupt to TEE which is where the firmware management mode resides.

Wouldn't the TEE be running in M-mode?  Or where is it expected to be running?
- The baseline PCIe error or AER interrupt is able to be morphed to firmware-first interrupt before delivering to H/HS software. This gives firmware a chance to log the error, correct the error or hide the error from S/HS software according to OEM RAS policy.

In x86 and ARM platforms, doesn't the OS pretty much always handle PCIe AER errors (i.e. OS-first for this class of errors)?  (I was reading an Intel overview doc recently that essentially said that - irrespective of whether other classes of errors are OS-first or firmware-first).)

Besides memory and PCIe RAS, do we have RAS errors for the processor/HART? such as IPI error or some CE/UC/UCR to HART locally?

Definitely there will be processor/hart errors.  Presumably each hart would output one or more RAS interrupt request signals.


Join to automatically receive all group messages.