Re: [PATCH 1/1] RAS features for OS-A platform server extension

Kumar Sankaran

On Wed, Jun 23, 2021 at 9:00 AM Greg Favor <gfavor@...> wrote:

On Wed, Jun 23, 2021 at 7:59 AM Abner Chang <renba.chang@...> wrote:

Yes. Which is just a software matter of configuring the interrupt controller accordingly.
Does this mean the interrupt controller would integrate all RAS events (HART, PCI, I/O, memory and etc.)?
Or there would be a separate hardware box that manages all RAS error events, and maybe some error signals output from that box and connected to the interrupt controller? The interrupt controller just provides the mechanism to morph those error signals to FFM or OSF interrupt?

To the extent that "RAS interrupts" are literally that, i.e. interrupt request signals, then they go to the system interrupt controller just like all other interrupt request signals. (Some system designs might also have a "platform microcontroller" that has its own local interrupt controller and may receive some of these interrupt request signals.)

Maybe part of what you're trying to get at is that RAS error events in many architectures get logged in and reported from hardware RAS registers. RAS registers "report" errors by outputting RAS interrupt request signals. Software then comes back around and reads the RAS registers to gather info about logged errors.

Can we summarize the requirement to

- RAS errors should be capable of interrupting TEE.
This is ok for now because there is no hardware signal defined for triggering TEE right? I have more comments on this below.

I expect RV will have similarities to ARM in this matter - and ARM doesn't have a hardware signal defined for triggering TEE either (and hasn't felt the need to define such).

This implies a requirement to have a TEE - and defining what constitutes a compliant TEE in the platform spec. Btw, what distinguishes the TEE from "firmware"?
Please correct me on ARM part if I am wrong.
The equivalent mechanism to TEE is SMM on X86 and TZ on ARM. I don't quite understand how ARM TZ works, however on X86 system, all cores are brought to SMM environment when SMI is triggered. ARM has the equivalent event which is SMC, right?

Neither ARM nor RISC-V has a direct equivalent of SMM. So I'll pick on what ARM has - which is rather like RV. At a hardware level ARM has EL3 and Secure ELx, and RV as M-mode and secure partitions of S/U-mode (using PMP). At a software level one has a Secure monitor running in EL3/M-mode and tbd whether other parts run in SELx/partitions. TZ as a TEE is a combination of these hardware features and the secure software that runs on it. ARM TZ doesn't specify the actual software TEE, it just provides the hardware architectural features and framework for creating and running a TEE. There is no one standard ARM TEE (although ARM has developed their ATF as a reference secure boot flow; although maybe it has expanded in scope in recent years?).

In short, RV first needs to define, develop, and specify a software TEE. The hardware components are falling into place (e.g. PMP, ePMP, Zkr), and OpenSBI is working towards supporting secure partitions. So, until there is a concrete RISC-V TEE standard (or even a standard framework), we shouldn't be stating requirements tied with having a TEE. Also keep in mind that things like secure boot will be required in the Server extension - which is part of the overall topic of TEE.

The above is called management mode (MM) which is defined in UEFI PI spec. MM has the highest privilege than CR0 on X86 and EL3 on ARM. The MM is OS agnostic and the MM event halts any processes and gets the core into management mode to run the firmware code. The environment of MM (data and code) can only be accessed when the core is in MM. Firmware always uses this for the secure stuff, power management, and of course the RAS.

What you describe, for RV, is M-mode - a pretty direct analog of ARM EL3.

I would like to add one more thing to the RAS requirement but I don't know how to describe it properly because seems we don't have the MM event on RISC-V such as SMI and SMC which can bring the system to MM.

RV has ECALL, just like ARM has SMC.

So there are two scenarios for RAS on the firmware first model.
- If the platform doesn't have TEE and the hardware event to trigger TEE:
If the RAS event is configured to firmware first mode, the platform should be able to trigger M-Mode exception to all harts in the physical processor. This prevents the subsequent RAS error propagated by other harts that access the problematic hardware (PCI, memory and etc.)

- If the platform has TEE and the hardware event to trigger TEE:
If the RAS event is configured to firmware first mode, the platform should be able to trigger TEE event to all harts in the physical processor and bring all harts into TEE. This prevents the subsequent RAS error propagated by other cores which access the problematic hardware (PCI, memory and etc.)

I think part of what complicates this discussion is the nebulous nature of what exactly is the "TEE" in any given architecture. At a hardware level x86/ARM/RV have SMM/EL3/M-mode and they have ways to "call" into that secure environment. The software TEE architecture is what is rather nebulous. There isn't a standard software TEE architecture for x86; RV doesn't have something (yet), and ARM has just ATF (which one may or may not fully equate to being a "TEE").

Given where we are currently with the lack of a proper definition for
TEE, I suggest we simply remove the requirement for TEE for now and
add it later when the TEE spec is finalized.
Suggest we remove the line "RAS errors should be capable of
interrupting TEE" and leave it at that.


Join to automatically receive all group messages.