Re: [PATCH 1/1] RAS features for OS-A platform server extension

Abner Chang

Greg Favor <gfavor@...> 於 2021年6月24日 週四 上午12:00寫道:
On Wed, Jun 23, 2021 at 7:59 AM Abner Chang <renba.chang@...> wrote:
Yes.  Which is just a software matter of configuring the interrupt controller accordingly.
Does this mean the interrupt controller would integrate all RAS events (HART, PCI, I/O, memory and etc.)? 
Or there would be a separate hardware box that manages all RAS error events, and maybe some error signals output from that box and connected to the interrupt controller? The interrupt controller just provides the mechanism to morph those error signals to FFM or OSF interrupt?

To the extent that "RAS interrupts" are literally that, i.e. interrupt request signals, then they go to the system interrupt controller just like all other interrupt request signals.  (Some system designs might also have a "platform microcontroller" that has its own local interrupt controller and may receive some of these interrupt request signals.)

Maybe part of what you're trying to get at is that RAS error events in many architectures get logged in and reported from hardware RAS registers.  RAS registers "report" errors by outputting RAS interrupt request signals.  Software then comes back around and reads the RAS registers to gather info about logged errors.
Yes, something likes that.

Do we need to define what is the RAS error signals output to the interrupt controller? (The signal could be classified by the error severities such as CE, UC_FATAL, UC_NONFATAL or classified by the RAS error categories such as RAS_MEM_ERROR, RAS_IO_ERROR and etc.)
I think we can just leave it to RAS TG because we just define what server platform needs on RAS, right?
Can we summarize the requirement to
- RAS errors should be capable of interrupting TEE.
This is ok for now because there is no hardware signal defined for triggering TEE right? I have more comments on this below. 

I expect RV will have similarities to ARM in this matter - and ARM doesn't have a hardware signal defined for triggering TEE either (and hasn't felt the need to define such).
Ok,  I thought there is a similar hardware signal.

Without the hardware signal to trigger TEE. The alternative would be triggering the M-mode exception and jump to TEE in the M-mode exception handler?
So the scenario of triggering TEE would be,
For software management mode interface:
     S-mode-> sbi ecall to M-mode->TEE jump vector->TEE
For the hardware management mode interface:
Hardware interrupt -> M-mode handler-> TEE jump vector->TEE
What firmware or software resides in TEE is implementation-specific. For example on edk2, we will load the management mode core into TEE.
I am just trying to get more understanding of the future design of TEE on RV.


This implies a requirement to have a TEE - and defining what constitutes a compliant TEE in the platform spec.  Btw, what distinguishes the TEE from "firmware"?
Please correct me on ARM part if I am wrong.
The equivalent mechanism to TEE is SMM on X86 and TZ on ARM. I don't quite understand how ARM TZ works, however on X86 system, all cores are brought to SMM environment when SMI is triggered. ARM has the equivalent event which is SMC, right?

Neither ARM nor RISC-V has a direct equivalent of SMM.  So I'll pick on what ARM has - which is rather like RV.  At a hardware level ARM has EL3 and Secure ELx, and RV as M-mode and secure partitions of S/U-mode (using PMP).  At a software level one has a Secure monitor running in EL3/M-mode and tbd whether other parts run in SELx/partitions.  TZ as a TEE is a combination of these hardware features and the secure software that runs on it.  ARM TZ doesn't specify the actual software TEE, it just provides the hardware architectural features and framework for creating and running a TEE.  There is no one standard ARM TEE (although ARM has developed their ATF as a reference secure boot flow; although maybe it has expanded in scope in recent years?).

In short, RV first needs to define, develop, and specify a software TEE.  The hardware components are falling into place (e.g. PMP, ePMP, Zkr), and OpenSBI is working towards supporting secure partitions.  So, until there is a concrete RISC-V TEE standard (or even a standard framework), we shouldn't be stating requirements tied with having a TEE.  Also keep in mind that things like secure boot will be required in the Server extension - which is part of the overall topic of TEE.
Thanks for the above explanation. 
The above is called management mode (MM) which is defined in UEFI PI spec. MM has the highest privilege than CR0 on X86 and EL3 on ARM. The MM is OS agnostic and the MM event halts any processes and gets the core into management mode to run the firmware code. The environment of MM (data and code) can only be accessed when the core is in MM. Firmware always uses this for the secure stuff, power management, and of course the RAS.

What you describe, for RV, is M-mode - a pretty direct analog of ARM EL3.

I would like to add one more thing to the RAS requirement but I don't know how to describe it properly because seems we don't have the MM event on RISC-V such as SMI and SMC which can bring the system to MM.

RV has ECALL, just like ARM has SMC.
Thanks for the correction. I thought SMC is the hardware signal.  
So there are two scenarios for RAS on the firmware first model.
- If the platform doesn't have TEE and the hardware event to trigger TEE:
  If the RAS event is configured to firmware first mode, the platform should be able to trigger M-Mode exception to all harts in the physical processor. This prevents the subsequent RAS error propagated by other harts that access the problematic hardware (PCI, memory and etc.)

- If the platform has TEE and the hardware event to trigger TEE:
    If the RAS event is configured to firmware first mode, the platform should be able to trigger TEE event to all harts in the physical processor and bring all harts into TEE. This prevents the subsequent RAS error propagated by other cores which access the problematic hardware (PCI, memory and etc.) 

I think part of what complicates this discussion is the nebulous nature of what exactly is the "TEE" in any given architecture.  At a hardware level x86/ARM/RV have SMM/EL3/M-mode and they have ways to "call" into that secure environment.  The software TEE architecture is what is rather nebulous.  There isn't a standard software TEE architecture for x86; RV doesn't have something (yet), and ARM has just ATF (which one may or may not fully equate to being a "TEE").


Join to automatically receive all group messages.