Re: [PATCH 1/1] RAS features for OS-A platform server extension

Greg Favor

On Fri, Jun 18, 2021 at 9:01 AM Abner Chang <renba.chang@...> wrote:
Greg Favor <gfavor@...> 於 2021年6月18日 週五 上午2:03寫道:
On Thu, Jun 17, 2021 at 8:56 AM Abner Chang <renba.chang@...> wrote:
- The platform should provide the capability to configure each RAS error to trigger firmware-first or OS-first error interrupt.
- If the RAS error is handled by firmware, the firmware should be able to choose to expose the error to S/HS mode for further processes or just hide the error from S/HS software. This requires some mechanisms provided by the platform and the mechanism should be protected by M-mode.

I would have thought that this is just a software issue.  What kind of hardware mechanism do you picture being needed?
That could be,
- If RAS error triggers M-mode (FFM) and firmware decides to expose the error to OS (could be configured through CSR or RAS registers), then the RAS OS interrupt can be triggered when the system exits M-mode.
- or If RAS error  triggers Management mode in TEE, then  the RAS OS interrupt to can be triggered when the system exits TEE.
The knob of exposing RAS errors to OS could go with each RAS error configuration register or just one centralized RAS register or CSR for all RAS errors.
Suppose the event to bring the system to TEE has the most priority even the system is executing in M-Mode. This makes sure firmware can address the RAS error immediately when it happens in any privilege.

Thanks.  This does seem to be all a matter of software configuring and handling things appropriately.

- We should also consider triggering RAS error interrupt to TEE which is where the firmware management mode resides.

Wouldn't the TEE be running in M-mode?  Or where is it expected to be running?
yes,TEE is be running in M-mode if the memory serves me right from the spec. My expectation of TEE is there would be an event that can be triggered by either hardware or software to bring the system to TEE no matter which mode the HART is currently running, I am not sure if this is how TEE would be implemented.

Then this just becomes a matter of software configuring the interrupt controller to direct a given interrupt source to a given privilege mode.

- The baseline PCIe error or AER interrupt is able to be morphed to firmware-first interrupt before delivering to H/HS software. This gives firmware a chance to log the error, correct the error or hide the error from S/HS software according to OEM RAS policy.

In x86 and ARM platforms, doesn't the OS pretty much always handle PCIe AER errors (i.e. OS-first for this class of errors)?  (I was reading an Intel overview doc recently that essentially said that - irrespective of whether other classes of errors are OS-first or firmware-first).)
Besides correcting the error in firmware, firmware also logs the necessary PCIe error events to BMC before OS handling that. The firmware RAS logs are retrieved in out-of-band even the system is shut down or the OS crashes. This increases the diagnosability and decreases the cost of customer service in the field.

Just fyi, this paper discusses use of both models in the x86 world: a-tour-beyond-bios-implementing-the-acpi-platform-error-interface-with-the-uefi.  As a number of us will remember from the ARMv8 days, there were big (as in religious) arguments over which model was the right one to adopt.  Ultimately it was accepted that both need to be supported by the architecture.  The point being that the OS/A platform spec should support both and not presume one as the one and only answer.

Join to automatically receive all group messages.