Re: [PATCH 1/1] RAS features for OS-A platform server extension


Kumar Sankaran
 

Greg - Do you have any further comments/responses to Abner's comments below?
Abner - my comments inline below.

On Fri, Jun 18, 2021 at 9:01 AM Abner Chang <renba.chang@...> wrote:



Greg Favor <gfavor@...> 於 2021年6月18日 週五 上午2:03寫道:

On Thu, Jun 17, 2021 at 8:56 AM Abner Chang <renba.chang@...> wrote:

- The platform should provide the capability to configure each RAS error to trigger firmware-first or OS-first error interrupt.
- If the RAS error is handled by firmware, the firmware should be able to choose to expose the error to S/HS mode for further processes or just hide the error from S/HS software. This requires some mechanisms provided by the platform and the mechanism should be protected by M-mode.

I would have thought that this is just a software issue. What kind of hardware mechanism do you picture being needed?
That could be,
- If RAS error triggers M-mode (FFM) and firmware decides to expose the error to OS (could be configured through CSR or RAS registers), then the RAS OS interrupt can be triggered when the system exits M-mode.
- or If RAS error triggers Management mode in TEE, then the RAS OS interrupt to can be triggered when the system exits TEE.
The knob of exposing RAS errors to OS could go with each RAS error configuration register or just one centralized RAS register or CSR for all RAS errors.
Suppose the event to bring the system to TEE has the most priority even the system is executing in M-Mode. This makes sure firmware can address the RAS error immediately when it happens in any privilege.
I think the primary requirements here are the following:
- The platform should provide the capability to configure each RAS
error to trigger firmware-first or OS-first error interrupt.
- If the RAS error is handled by firmware, the firmware should be able
to choose to expose the error to S/HS mode for further processes or
just hide the error from S/HS software.
Is there a need to provide all the other details?




- Each RAS error should be able to mask through RAS configuration registers.

By "mask" do you mean masking of generation of an error interrupt?
Yes, to mask the RAS error interrupt or even not to create the log (in RAS status registers or CSR) that OEM doesn't consider that is a useful or important error to product.

This is fine


- We should also consider triggering RAS error interrupt to TEE which is where the firmware management mode resides.

Wouldn't the TEE be running in M-mode? Or where is it expected to be running?
yes,TEE is be running in M-mode if the memory serves me right from the spec. My expectation of TEE is there would be an event that can be triggered by either hardware or software to bring the system to TEE no matter which mode the HART is currently running, I am not sure if this is how TEE would be implemented.

Can we summarize the requirement to
- RAS errors should be capable of interrupting TEE.


For PCIe RAS,
- The baseline PCIe error or AER interrupt is able to be morphed to firmware-first interrupt before delivering to H/HS software. This gives firmware a chance to log the error, correct the error or hide the error from S/HS software according to OEM RAS policy.

In x86 and ARM platforms, doesn't the OS pretty much always handle PCIe AER errors (i.e. OS-first for this class of errors)? (I was reading an Intel overview doc recently that essentially said that - irrespective of whether other classes of errors are OS-first or firmware-first).)
Besides correcting the error in firmware, firmware also logs the necessary PCIe error events to BMC before OS handling that. The firmware RAS logs are retrieved in out-of-band even the system is shut down or the OS crashes. This increases the diagnosability and decreases the cost of customer service in the field.

Abner
The PCIe AER errors have been handled OS first on X86 systems. If I
recall correct, ARM64 initially made PCIe AER errors firmware first
and then later changed to OS first to be compliant with what's already
out there.
The exact manner of handling these PCIe AER errors is also OEM
dependent. Some OEMs will handle it OS first while making a call to
the firmware to take additional corrective action of notifying the BMC
and such. Some ARM64 implementations handle this firmware first and
notify the BMC and then notify the OS.
From a RISC-V platforms requirements perspective, my suggestion is we
simply mention the capability of all errors to have support for
firmware first and OS first and leave it at that.



Besides memory and PCIe RAS, do we have RAS errors for the processor/HART? such as IPI error or some CE/UC/UCR to HART locally?

Definitely there will be processor/hart errors. Presumably each hart would output one or more RAS interrupt request signals.

Greg
Yes, there will be more RAS errors. For the initial spec, we are only
making the bare minimal set of RAS features mandatory for the server
extension for 2022. We can add more RAS features as things solidify.

--
Regards
Kumar

Join tech-unixplatformspec@lists.riscv.org to automatically receive all group messages.