Platform specification questions


Ved Shanbhogue
 

Greetings All!

Please help with the following questions about the OS-A/server extension:
Section 2.1.4.1 - Timer support:
Should the ACLINT MTIMER support should be optional or moved into the M-platform section or made optional for the server extension?

Section 2.1.4.2.4:
"Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities.

Section 2.1.7.1:
Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported?

Section 2.3.7.3.2 - PCIe memory space:
The requirement to not have any address translation for inbound accesses to any component in system address space is restrictive. If direct assignment of devices is supported then the IOMMU would be required to do the address translation for inbound accesses. Further for hart originated accesses where the PCIe memory is mapped into virtual address space there needs to be a translation through the first and/or second level page tables. Please help clarify why PCie memory must not be mapped into virtual address space and why use of IOMMU to do translation is disallowed by the specification.

Section 2.3.7.3.3 - PCIe interrupts:
It seems unnecessary to require platforms built for the '22 version of the platform to have to support running software that is not MSI aware. Please clarify why supporting the INTx emulation for legacy/Pre-PCIe software compatibility a required and not an optional capability for RISC-v platforms?

Section 2.3.9:
For many platforms especially server class platforms SECDED-ECC is not sufficient protection for main memory. Why is the platform restricting a platform to only support SECDED-ECC? Is it a violation of the platform specification if the platform supports stronger or more advanced ECC than SECDEC-ECC?

Which caches are/should be protected and how is usually a function of the technology used by the cache, the operating conditions, and the SDC/DUE goals of the platform and are established based on FIT modeling for that platform. Please clarify rationale for requiring single-bit error correction on all caches? Also please clarify why the spec dont allow for correcting multi-bit errors; based on the SDC goals some caches may need to support for e.g. triple error detection and double error correction..

The firmware-first model seems to require a configuration per RAS-event/RAS-error-source to trigger firmware-first or OS-first. There may be hundreds of RAS events/errors that a platform supports. Why is it required to support per-event/per-error selectivity vs. a two level selection where all RAS errors are either handled by firmware-first or handled by OS-first.

regards
ved


Anup Patel
 

Hi Ved,

Please see comments inline below ...

Regards,
Anup

On Mon, Dec 13, 2021 at 5:45 AM Vedvyas Shanbhogue <ved@...> wrote:

Greetings All!

Please help with the following questions about the OS-A/server extension:
Section 2.1.4.1 - Timer support:
Should the ACLINT MTIMER support should be optional or moved into the M-platform section or made optional for the server extension?
The RISC-V Privileged v1.12 defines MTIME and MTIMECMP as platform
specific memory-mapped registers in "Section 3.2 Machine-Level
Memory-Mapped Registers". This means the RISC-V platform specification
needs to standardize memory layout and arrangement of the MTIME and
MTIMECMP memory-mapped registers which is what ACLINT MTIMER
specification does.
(Refer, https://github.com/riscv/riscv-isa-manual/releases/download/Priv-v1.12/riscv-privileged-20211203.pdf)

Since, both OS-A platforms with M-mode and M platforms need ACLINT
MTIMER so I suggest that OS-A platforms should say "If M-mode is
implemented then ACLINT MTIMER should be supported ...".


Section 2.1.4.2.4:
"Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities.
The AIA IMSIC devices do not provide interrupts via minor identities
3, 1, and 2. Both AIA IMSIC and APLIC devices only deal with minor
identities 9, 10, and 11 (i.e. external interrupts) whereas the ACLINT
specification defines devices that deal with minor identities 1, 3,
and 7.

For software, the "inter-processor interrupts" (at M-mode or S-mode)
can be implemented:
1) Using ACLINT MSWI or SSW devices
OR
2) Using AIA IMSIC devices

I think the confusion here is because the RISC-V platform
specification uses the term "software interrupt" for both
"inter-processor interrupt" and "minor identities 3, 1, and 2". I
suggest using the term "inter-processor interrupt" at most places and
only use the term "software interrupt" in-context of ACLINT MSWI or
SSWI devices.


Section 2.1.7.1:
Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported?
I agree, this text needs to be improved because now Base and Server
are separate platforms. Since, the Server platform mandates IMSIC and
Priv Sstc extension so SBI TIME, IPI and RFENCE can be optional but
this is not true for the Base platform.


Section 2.3.7.3.2 - PCIe memory space:
The requirement to not have any address translation for inbound accesses to any component in system address space is restrictive. If direct assignment of devices is supported then the IOMMU would be required to do the address translation for inbound accesses. Further for hart originated accesses where the PCIe memory is mapped into virtual address space there needs to be a translation through the first and/or second level page tables. Please help clarify why PCie memory must not be mapped into virtual address space and why use of IOMMU to do translation is disallowed by the specification.

Section 2.3.7.3.3 - PCIe interrupts:
It seems unnecessary to require platforms built for the '22 version of the platform to have to support running software that is not MSI aware. Please clarify why supporting the INTx emulation for legacy/Pre-PCIe software compatibility a required and not an optional capability for RISC-v platforms?

Section 2.3.9:
For many platforms especially server class platforms SECDED-ECC is not sufficient protection for main memory. Why is the platform restricting a platform to only support SECDED-ECC? Is it a violation of the platform specification if the platform supports stronger or more advanced ECC than SECDEC-ECC?

Which caches are/should be protected and how is usually a function of the technology used by the cache, the operating conditions, and the SDC/DUE goals of the platform and are established based on FIT modeling for that platform. Please clarify rationale for requiring single-bit error correction on all caches? Also please clarify why the spec dont allow for correcting multi-bit errors; based on the SDC goals some caches may need to support for e.g. triple error detection and double error correction..

The firmware-first model seems to require a configuration per RAS-event/RAS-error-source to trigger firmware-first or OS-first. There may be hundreds of RAS events/errors that a platform supports. Why is it required to support per-event/per-error selectivity vs. a two level selection where all RAS errors are either handled by firmware-first or handled by OS-first.

regards
ved





Greg Favor
 

On Sun, Dec 12, 2021 at 7:55 PM Anup Patel <anup@...> wrote:
On Mon, Dec 13, 2021 at 5:45 AM Vedvyas Shanbhogue <ved@...> wrote:
> Please help with the following questions about the OS-A/server extension:
> Section 2.1.4.1 - Timer support:
> Should the ACLINT MTIMER support should be optional or moved into the M-platform section or made optional for the server extension?

The RISC-V Privileged v1.12 defines MTIME and MTIMECMP as platform
specific memory-mapped registers in "Section 3.2 Machine-Level
Memory-Mapped Registers". This means the RISC-V platform specification
needs to standardize memory layout and arrangement of the MTIME and
MTIMECMP memory-mapped registers which is what ACLINT MTIMER
specification does.
 
Since, both OS-A platforms with M-mode and M platforms need ACLINT
MTIMER so I suggest that OS-A platforms should say "If M-mode is
implemented then ACLINT MTIMER should be supported ...".

Here's a response from a different angle.  MTIME matters to the SEE because it provides the timebase that is then seen by all harts in their 'time' CSRs (via the RDTIME pseudoinstruction).  But if the initial OS-A platform specs are going to drop any M-mode standardization/etc., then it seems like the thing to do - from the SEE and OS-A platform perspectives - is to abstract MTIME as just the "system timebase that propagates to all harts and is seen by S/HS/U mode software in the form of the 'time' CSR" (just as the Unpriv spec does in its own words).

Whatever would be said about MTIME and tick period constraints (e.g. a minimum tick period) would instead be expressed wrt this abstracted timebase - which the Unpriv spec refers to as "wall-clock real time that has passed from an arbitrary start time in the past. .... The execution environment should provide a means of determining the period of a counter tick (seconds/tick).  ...".

This separates out from the current OS-A platform specs the ACLINT MTIMER device as a standardized Machine-level implementation of the MTIME and MTIMECMP registers defined in the Priv spec.

Now, for systems that implement Priv 1.12 and the Sstc extension, and actually use the Sstc extension, then this can be the end of the story.

But for today's systems and for future systems that don't implement Sstc (unless all OS-A 2022 platform specs were to mandate Sstc support and eliminate any possibility of existing systems complying with at least the Embedded (i.e. old "Base") OS-A platform spec), they also need the SBI API that provides Supervisor timer functionality to S/HS mode (with M-mode using MTIME and MTIMECMP to provide that functionality).  While this is also an SEE interface, talking about this does start to sneak up on talking about MTIME.  But then again one could still abstract MTIME as the system timebase, and MTIMECMP as a timebase compare value.

Greg


Ved Shanbhogue
 

On Sun, Dec 12, 2021 at 7:55 PM Anup Patel <anup@...> wrote:
Since, both OS-A platforms with M-mode and M platforms need ACLINT
MTIMER so I suggest that OS-A platforms should say "If M-mode is
implemented then ACLINT MTIMER should be supported ...".
I was thinking along the lines of how Greg was thinking here.


On Sun, Dec 12, 2021 at 08:43:30PM -0800, Greg Favor wrote:
This separates out from the current OS-A platform specs the ACLINT MTIMER
device as a standardized Machine-level implementation of the MTIME and
MTIMECMP registers defined in the Priv spec.

Now, for systems that implement Priv 1.12 and the Sstc extension, and
actually use the Sstc extension, then this can be the end of the story.
Agree.

But for today's systems and for future systems that don't implement Sstc
(unless all OS-A 2022 platform specs were to mandate Sstc support and
eliminate any possibility of existing systems complying with at least the
Embedded (i.e. old "Base") OS-A platform spec), they also need the SBI API
that provides Supervisor timer functionality to S/HS mode (with M-mode
using MTIME and MTIMECMP to provide that functionality). While this is
also an SEE interface, talking about this does start to sneak up on talking
about MTIME. But then again one could still abstract MTIME as the system
timebase, and MTIMECMP as a timebase compare value.
Agree.

regards
ved


Ved Shanbhogue
 

Hi Anup
On Mon, Dec 13, 2021 at 09:25:43AM +0530, Anup Patel wrote:

Section 2.1.4.2.4:
"Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities.
I think the confusion here is because the RISC-V platform
specification uses the term "software interrupt" for both
"inter-processor interrupt" and "minor identities 3, 1, and 2". I
suggest using the term "inter-processor interrupt" at most places and
only use the term "software interrupt" in-context of ACLINT MSWI or
SSWI devices.
Yes, that was my conclusion that "software interrupt" here was used to mean an IPI. I think clearing this up would be helpful.


Section 2.1.7.1:
Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported?
I agree, this text needs to be improved because now Base and Server
are separate platforms. Since, the Server platform mandates IMSIC and
Priv Sstc extension so SBI TIME, IPI and RFENCE can be optional but
this is not true for the Base platform.
Yes, however the Server is additived to the base as written. Even for base, if Sstc and IMSIC are supported then SBI TIME, IPI, and RFENCE can be optional.

regards
ved


Anup Patel
 

On Mon, Dec 13, 2021 at 6:46 PM Ved Shanbhogue <ved@...> wrote:

Hi Anup
On Mon, Dec 13, 2021 at 09:25:43AM +0530, Anup Patel wrote:

Section 2.1.4.2.4:
"Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities.
I think the confusion here is because the RISC-V platform
specification uses the term "software interrupt" for both
"inter-processor interrupt" and "minor identities 3, 1, and 2". I
suggest using the term "inter-processor interrupt" at most places and
only use the term "software interrupt" in-context of ACLINT MSWI or
SSWI devices.
Yes, that was my conclusion that "software interrupt" here was used to mean an IPI. I think clearing this up would be helpful.


Section 2.1.7.1:
Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported?
I agree, this text needs to be improved because now Base and Server
are separate platforms. Since, the Server platform mandates IMSIC and
Priv Sstc extension so SBI TIME, IPI and RFENCE can be optional but
this is not true for the Base platform.
Yes, however the Server is additived to the base as written. Even for base, if Sstc and IMSIC are supported then SBI TIME, IPI, and RFENCE can be optional.
The current text/organization is going to change (as discussed in
previous meetings). The Server platform will be a separate platform
independent of the Base platform because some of the requirements will
be different for both platforms. (@Kumar/Atish please add if I missed
anything)

For the Base platform, I agree we can make SBI TIME, IPI and RFENCE
mandatory only when IMSIC and Sstc is not present. (@Atish do you
recall any other rationale in this context ?)

Regards,
Anup


regards
ved


Kumar Sankaran
 

On Mon, Dec 13, 2021 at 8:30 AM Anup Patel <anup@...> wrote:

On Mon, Dec 13, 2021 at 6:46 PM Ved Shanbhogue <ved@...> wrote:

Hi Anup
On Mon, Dec 13, 2021 at 09:25:43AM +0530, Anup Patel wrote:

Section 2.1.4.2.4:
"Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities.
I think the confusion here is because the RISC-V platform
specification uses the term "software interrupt" for both
"inter-processor interrupt" and "minor identities 3, 1, and 2". I
suggest using the term "inter-processor interrupt" at most places and
only use the term "software interrupt" in-context of ACLINT MSWI or
SSWI devices.
Yes, that was my conclusion that "software interrupt" here was used to mean an IPI. I think clearing this up would be helpful.


Section 2.1.7.1:
Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported?
I agree, this text needs to be improved because now Base and Server
are separate platforms. Since, the Server platform mandates IMSIC and
Priv Sstc extension so SBI TIME, IPI and RFENCE can be optional but
this is not true for the Base platform.
Yes, however the Server is additived to the base as written. Even for base, if Sstc and IMSIC are supported then SBI TIME, IPI, and RFENCE can be optional.
The current text/organization is going to change (as discussed in
previous meetings). The Server platform will be a separate platform
independent of the Base platform because some of the requirements will
be different for both platforms. (@Kumar/Atish please add if I missed
anything)
Yes, as per our agreement during the Platform HSC meeting several
weeks back, the plan is to make the OS-A Embedded and OS-A Server as
individual platforms without any relationship to each other.
Common requirements between OS-A Embedded and OS-A Server will be put
into a new section called OS-A Common Requirements. This way, we can
have separate requirements for each platform independent of the other.
So the OS-A Server will NOT be an extension of OS-A Embedded anymore
but a separate platform.

For the Base platform, I agree we can make SBI TIME, IPI and RFENCE
mandatory only when IMSIC and Sstc is not present. (@Atish do you
recall any other rationale in this context ?)

Regards,
Anup


regards
ved




--
Regards
Kumar


Kumar Sankaran
 

On Sun, Dec 12, 2021 at 4:15 PM Vedvyas Shanbhogue <ved@...> wrote:

Greetings All!

Please help with the following questions about the OS-A/server extension:
Section 2.1.4.1 - Timer support:
Should the ACLINT MTIMER support should be optional or moved into the M-platform section or made optional for the server extension?

Section 2.1.4.2.4:
"Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities.

Section 2.1.7.1:
Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported?

Section 2.3.7.3.2 - PCIe memory space:
The requirement to not have any address translation for inbound accesses to any component in system address space is restrictive. If direct assignment of devices is supported then the IOMMU would be required to do the address translation for inbound accesses. Further for hart originated accesses where the PCIe memory is mapped into virtual address space there needs to be a translation through the first and/or second level page tables. Please help clarify why PCie memory must not be mapped into virtual address space and why use of IOMMU to do translation is disallowed by the specification.

Section 2.3.7.3.3 - PCIe interrupts:
It seems unnecessary to require platforms built for the '22 version of the platform to have to support running software that is not MSI aware. Please clarify why supporting the INTx emulation for legacy/Pre-PCIe software compatibility a required and not an optional capability for RISC-v platforms?

Section 2.3.9:
For many platforms especially server class platforms SECDED-ECC is not sufficient protection for main memory. Why is the platform restricting a platform to only support SECDED-ECC? Is it a violation of the platform specification if the platform supports stronger or more advanced ECC than SECDEC-ECC?
Agree. The intent here was to mandate a minimal set of memory
protection features for server class platforms. It is not a violation
of the platform spec to have something better. As per the discussions
within the group, the RAS specification is something that needs to be
taken up by a TG within RISC-V and driven to completion. At this point
in time, we don't have a RAS spec, while at the same time, we didn't
want to leave this topic completely off of the platform spec specially
for servers.
Will wording this as "Main memory must be protected with SECDED-ECC at
the minimum or a stronger/advanced method of protection" suffice?


Which caches are/should be protected and how is usually a function of the technology used by the cache, the operating conditions, and the SDC/DUE goals of the platform and are established based on FIT modeling for that platform. Please clarify rationale for requiring single-bit error correction on all caches? Also please clarify why the spec dont allow for correcting multi-bit errors; based on the SDC goals some caches may need to support for e.g. triple error detection and double error correction..
The rationale was to have a minimal set of RAS requirements until we
have a proper RISC-V RAS spec that we can refer to. Hence, having a
single-bit error correction was a minimalistic requirement. It is not
a platform spec violation to have the ability to correct multi-bit
errors.
The current wording is the following.
All cache structures must be protected.
single-bit errors must be detected and corrected.
multi-bit errors can be detected and reported.
Platforms are free to implement more advanced features than the
minimalistic requirements that are mandated here. So we should be OK.
Agree?


The firmware-first model seems to require a configuration per RAS-event/RAS-error-source to trigger firmware-first or OS-first. There may be hundreds of RAS events/errors that a platform supports. Why is it required to support per-event/per-error selectivity vs. a two level selection where all RAS errors are either handled by firmware-first or handled by OS-first.
Yes, there may be hundreds of RAS errors/events. The intent here was
that at the lowest level of granularity, we should be able to
selectively route each of these to the respective software/firmware
entity. So yes, we could add additional gates on top like the two
level selection you have suggested but the platform spec is simply
conveying the expected support at the lowest level.
The current wording is "The platform should provide the capability to
configure each RAS error to trigger firmware-first or OS-first error
interrupt".
Will this suffice or do we need to add more clarity?


regards
ved





--
Regards
Kumar


Ved Shanbhogue
 

On Mon, Dec 13, 2021 at 10:44:41AM -0800, Kumar Sankaran wrote:
Will wording this as "Main memory must be protected with SECDED-ECC at
the minimum or a stronger/advanced method of protection" suffice?
Thanks. Yes.

The current wording is the following.
All cache structures must be protected.
single-bit errors must be detected and corrected.
multi-bit errors can be detected and reported.
Platforms are free to implement more advanced features than the
minimalistic requirements that are mandated here. So we should be OK.
Agree?
Could I suggest:
"Cache structures must be protected to address the Failure-in-time (FIT) requirements. The protection mechanisms may included single-bit/multi-bit error detection and/or single/multi-bit error detection/correction schemes, replaying faulting instructions, lock-step execution, etc."


The current wording is "The platform should provide the capability to
configure each RAS error to trigger firmware-first or OS-first error
interrupt".
Will this suffice or do we need to add more clarity?
Could I suggest:
"The platform should provide capability to configure RAS errors to trigger firmware-first or OS-first error interrupts."

regards
ved


atishp@...
 



On Mon, Dec 13, 2021 at 8:30 AM Anup Patel <anup@...> wrote:
On Mon, Dec 13, 2021 at 6:46 PM Ved Shanbhogue <ved@...> wrote:
>
> Hi Anup
> On Mon, Dec 13, 2021 at 09:25:43AM +0530, Anup Patel wrote:
> >>
> >> Section 2.1.4.2.4:
> >> "Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities.
> >I think the confusion here is because the RISC-V platform
> >specification uses the term "software interrupt" for both
> >"inter-processor interrupt" and "minor identities 3, 1, and 2". I
> >suggest using the term "inter-processor interrupt" at most places and
> >only use the term "software interrupt" in-context of ACLINT MSWI or
> >SSWI devices.
> >
> Yes, that was my conclusion that "software interrupt" here was used to mean an IPI. I think clearing this up would be helpful.
>
> >>
> >> Section 2.1.7.1:
> >> Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported?
> >
> >I agree, this text needs to be improved because now Base and Server
> >are separate platforms. Since, the Server platform mandates IMSIC and
> >Priv Sstc extension so SBI TIME, IPI and RFENCE can be optional but
> >this is not true for the Base platform.
> >
> Yes, however the Server is additived to the base as written. Even for base, if Sstc and IMSIC are supported then SBI TIME, IPI, and RFENCE can be optional.

The current text/organization is going to change (as discussed in
previous meetings). The Server platform will be a separate platform
independent of the Base platform because some of the requirements will
be different for both platforms. (@Kumar/Atish please add if I missed
anything)

For the Base platform, I agree we can make SBI TIME, IPI and RFENCE
mandatory only when IMSIC and Sstc is not present. (@Atish do you
recall any other rationale in this context ?)


We should have a table with dependencies for SBI extensions. E.g.
SBI Time only required if sstc is not present
SBI IPI/RFENCE is only required if IMSIC or SSWI is not present

I will send a patch after the spec is broken into separate platforms (OS-A server and OS-A embedded)
 
Regards,
Anup

>
> regards
> ved






Kumar Sankaran
 

Thanks Ved. Minor nits below.
Would you be OK to send out a patch to the mailing list for these 3
changes and then subsequently a PR to the platform git on github? Let
me know if you need any help with this.

On Mon, Dec 13, 2021 at 11:06 AM Ved Shanbhogue <ved@...> wrote:

On Mon, Dec 13, 2021 at 10:44:41AM -0800, Kumar Sankaran wrote:
Will wording this as "Main memory must be protected with SECDED-ECC at
the minimum or a stronger/advanced method of protection" suffice?
Thanks. Yes.

The current wording is the following.
All cache structures must be protected.
single-bit errors must be detected and corrected.
multi-bit errors can be detected and reported.
Platforms are free to implement more advanced features than the
minimalistic requirements that are mandated here. So we should be OK.
Agree?
Could I suggest:
"Cache structures must be protected to address the Failure-in-time (FIT) requirements. The protection mechanisms may included single-bit/multi-bit error detection and/or single/multi-bit error detection/correction schemes, replaying faulting instructions, lock-step execution, etc."
Agree. I suggest we keep it high level and simply say "Cache
structures must be protected to address the Failure-in-time (FIT)
requirements. The protection mechanisms may include
single-bit/multi-bit error detection and/or single/multi-bit error
detection/correction schemes".



The current wording is "The platform should provide the capability to
configure each RAS error to trigger firmware-first or OS-first error
interrupt".
Will this suffice or do we need to add more clarity?
Could I suggest:
"The platform should provide capability to configure RAS errors to trigger firmware-first or OS-first error interrupts."
Agree.


regards
ved


--
Regards
Kumar


Ved Shanbhogue
 

On Mon, Dec 13, 2021 at 11:16:49AM -0800, Kumar Sankaran wrote:
Would you be OK to send out a patch to the mailing list for these 3
changes and then subsequently a PR to the platform git on github? Let
me know if you need any help with this.
Will be glad to.

Agree. I suggest we keep it high level and simply say "Cache
structures must be protected to address the Failure-in-time (FIT)
requirements. The protection mechanisms may include
single-bit/multi-bit error detection and/or single/multi-bit error
detection/correction schemes".
Yes, that sounds good.

regards
ved


Greg Favor
 

On Mon, Dec 13, 2021 at 11:06 AM Vedvyas Shanbhogue <ved@...> wrote:
>The current wording is the following.
>All cache structures must be protected.
>single-bit errors must be detected and corrected.
>multi-bit errors can be detected and reported.
>Platforms are free to implement more advanced features than the
>minimalistic requirements that are mandated here. So we should be OK.
>Agree?

Could I suggest:
"Cache structures must be protected to address the Failure-in-time (FIT) requirements. The protection mechanisms may included single-bit/multi-bit error detection and/or single/multi-bit error detection/correction schemes, replaying faulting instructions, lock-step execution, etc."

This seems like a toothless and qualitative mandate since no FIT requirements are specified.  It can be a suggestion, although it's just a qualitative suggestion.  It's essentially just saying "don't forget to consider FIT requirements".  One can imagine a hundred such reminders that factor into high-end silicon design.  Why highlight just this one?

The reference to "cache structures" is also incomplete - as well as ambiguous as to whether it refers just to caches (in the most popular sense of the word) or also to other caching structures like TLBs as well .  Most all RAM-based structures in which an error can result in functional failure, need to be protected.  Although one can take the view that the above text was just trying to express a minimum requirement that doesn't encompass all RAM-based structures.  My suggestion would be something like the following two statements:

Mandate:  At a minimum, caching structures must be protected such that single-bit errors are detected and corrected by hardware.

Recommendation:  Depending on FIT rate requirements, more advanced protection, more complete protection coverage of other structures, and/or more features may be necessary (starting with at least SECDED ECC on caching structures holding locally modified data).

Greg


Ved Shanbhogue
 

On Mon, Dec 13, 2021 at 02:00:38PM -0800, Greg Favor wrote:
On Mon, Dec 13, 2021 at 11:06 AM Vedvyas Shanbhogue <ved@...>
wrote:

The current wording is the following.
All cache structures must be protected.
single-bit errors must be detected and corrected.
multi-bit errors can be detected and reported.
Platforms are free to implement more advanced features than the
minimalistic requirements that are mandated here. So we should be OK.
Agree?
Could I suggest:
"Cache structures must be protected to address the Failure-in-time (FIT)
requirements. The protection mechanisms may included single-bit/multi-bit
error detection and/or single/multi-bit error detection/correction schemes,
replaying faulting instructions, lock-step execution, etc."
This seems like a toothless and qualitative mandate since no FIT
requirements are specified. It can be a suggestion, although it's just a
qualitative suggestion. It's essentially just saying "don't forget to
consider FIT requirements". One can imagine a hundred such reminders that
factor into high-end silicon design. Why highlight just this one?

The reference to "cache structures" is also incomplete - as well as
ambiguous as to whether it refers just to caches (in the most popular sense
of the word) or also to other caching structures like TLBs as well . Most
all RAM-based structures in which an error can result in functional
failure, need to be protected. Although one can take the view that the
above text was just trying to express a minimum requirement that doesn't
encompass all RAM-based structures. My suggestion would be something like
the following two statements:
Totally agree that the term "cache structure" is ambigous and variety of caches may be built. How caches are built should also be transparent to the ISA, software, and the platform in general. Like you said reliability engineering is not something that affects software compatibility or hardware/software contracts. And as you rightly pointed out, caches are most obvious but a reliable system will need more such as right thermal engineering, stable clock/voltage delivery, right ageing guardbands, use of gray codes when appropriate, voltage monitors, timing margin sensors, protection on data/control buses, protection on register files, protection on internal data paths, etc.

I would be totally okay with saying drop this whole paragraph.


Mandate: *At a minimum, caching structures must be protected such that
single-bit errors are detected and corrected by hardware.*
Would a mandate be overeaching and why limit it to caches then?

A product may define its reliability goals and may reason that a certain cache need not be protected due to various reasons like the technology in which the product is built, the altitude at which it is supposed to be used, the architectural vulnerability factor computed for that structure, etc.

I am failing to understand how would we be adding to or removing from the OS-A platform compatibility goals which is to be able to boot a shrink wrapper server operating system by trying to provide a mandate on how it implements reliability?

regards
ved


Greg Favor
 

On Mon, Dec 13, 2021 at 2:22 PM Ved Shanbhogue <ved@...> wrote:
>Mandate:  *At a minimum, caching structures must be protected such that
>single-bit errors are detected and corrected by hardware.*
>
Would a mandate be overeaching and why limit it to caches then?

This was just trying to mandate a basic requirement and not go as far as requiring protection of all RAM-based structures - which some may view as overreach.  Conversely I can understand that some people can view that "all caching structures" is already an overreach.  

A product may define its reliability goals and may reason that a certain cache need not be protected due to various reasons like the technology in which the product is built, the altitude at which it is supposed to be used, the architectural vulnerability factor computed for that structure, etc.

I am failing to understand how would we be adding to or removing from the OS-A platform compatibility goals which is to be able to boot a shrink wrapper server operating system by trying to provide a mandate on how it implements reliability?

I think this whole RAS-related topic in the current platform draft was to establish some form of modest RAS requirement (versus no requirement) until a proper RAS arch spec exists.  Although even then (assuming that arch spec is like x86 and ARM RAS specs that are just concerned with standardizing RAS registers for logging and the mechanisms for reporting errors), there still won't be any minimum requirement for actual error detection and correction.

Fundamentally, should the Server platform spec mandate ANY error detection/correction requirements, or just leave it as a wild west among hardware developers to individually and eventually figure out where the line exists as far as the basic needs for RAS in Server-compliant platforms?  And leave it for system integrators to discover that some Server-compliant hardware has less than "basic" RAS?

BUT if the platform spec is ONLY trying to establish hardware/software interoperability, and not also match up hardware and software expectations regarding other areas of functionality such as RAS, then that answers the question.  My own leaning is towards trying to address the latter versus the narrower view that the only concern is software interoperability.  But I understand the arguments both ways.

Greg


Ved Shanbhogue
 

On Mon, Dec 13, 2021 at 05:11:38PM -0800, Greg Favor wrote:
I think this whole RAS-related topic in the current platform draft was to
establish some form of modest RAS requirement (versus no requirement) until
a proper RAS arch spec exists. Although even then (assuming that arch spec
is like x86 and ARM RAS specs that are just concerned with standardizing
RAS registers for logging and the mechanisms for reporting errors), there
still won't be any minimum requirement for actual error detection and
correction.
I agree. I think the RAS ISA would want to be about standardized error logging and reporting but not mandate what errors are detected/corrected and how they are corrected or contained. For example, even in x86 and ARM space there are many product segments which have varying degrees of resilience but the RAS architecture flexibly covers the full spectrum of implementations between multiple x86 and ARM vendors.

Fundamentally, should the Server platform spec mandate ANY error
detection/correction requirements, or just leave it as a wild west among
hardware developers to individually and eventually figure out where the
line exists as far as the basic needs for RAS in *Server*-compliant
platforms? And leave it for system integrators to discover that some
Server-compliant hardware has less than "basic" RAS?
This was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement.

BUT if the platform spec is ONLY trying to establish hardware/software
interoperability, and not also match up hardware and software expectations
regarding other areas of functionality such as RAS, then that answers the
question. My own leaning is towards trying to address the latter versus
the narrower view that the only concern is software interoperability. But
I understand the arguments both ways.
My understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview. Aspects like "every cache must have single bit error correction" or "must implement SECDED-ECC" may not be necessary to acheive this objective. For example, an implementation may have two levels caches where instructions may be cached and for the lowest level the implementation may only implement parity but on a error refetch from a higher level cache or DDR where there might be ECC. So for such an implementation to require ECC in its instruction cache seems not required - the machine is meeting its FIT rate objectives through other means.

regards
ved


Greg Favor
 

On Mon, Dec 13, 2021 at 5:38 PM Ved Shanbhogue <ved@...> wrote:
This was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement.

I wouldn't view platform mandates of this sort as teaching, but as establishing a baseline that system integrators can depend on - by guiding the hardware developers as to what that expected baseline is.  (But I get your point.)
 
My understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview.

Agreed.

The fundamental question is whether the goal of the platform spec is solely to ensure hardware-software interoperability and not to go further in ensuring other minimum capabilities that compliant platforms will provide.  What should be said and not said about RAS follows from that.

Given that people are leaning towards the more limited scope or goal for the OS-A platforms, then that directly implies that there should be no requirements about what RAS features/coverage/etc. are actually implemented by compliant platforms.

Greg


Kumar Sankaran
 

On Mon, Dec 13, 2021 at 6:56 PM Greg Favor <gfavor@...> wrote:

On Mon, Dec 13, 2021 at 5:38 PM Ved Shanbhogue <ved@...> wrote:

This was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement.

I wouldn't view platform mandates of this sort as teaching, but as establishing a baseline that system integrators can depend on - by guiding the hardware developers as to what that expected baseline is. (But I get your point.)


My understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview.

Agreed.

The fundamental question is whether the goal of the platform spec is solely to ensure hardware-software interoperability and not to go further in ensuring other minimum capabilities that compliant platforms will provide. What should be said and not said about RAS follows from that.

Given that people are leaning towards the more limited scope or goal for the OS-A platforms, then that directly implies that there should be no requirements about what RAS features/coverage/etc. are actually implemented by compliant platforms.

Greg
The intent of the platform spec is hardware-software interoperability.
I agree that dictating RAS hardware features is not within the scope
of the platform spec. However, we do want standards for RAS error
handling, error detection, logging/reporting and such. For example
using APEI to convey error information to OSPM is needed for software
interop.
So one suggestion is we remove specific errors like single-bit errors,
multi-bit errors and such and limit the features to error handling,
detection and logging/reporting.

--
Regards
Kumar


Ved Shanbhogue
 

On Mon, Dec 13, 2021 at 08:47:51PM -0800, Kumar Sankaran wrote:

So one suggestion is we remove specific errors like single-bit errors,
multi-bit errors and such and limit the features to error handling,
detection and logging/reporting.
So we could drop these statements:
"
- Main memory must be protected with SECDED-ECC.
- All cache structures must be protected.
- single-bit errors must be detected and corrected.
- multi-bit errors can be detected and reported.
"

And change this statement to drop the restriction to "these protected structures":
"There must be memory-mapped RAS registers to log detected errors with information about the type and location of the error"

regards
ved


Philipp Tomsich
 

Kumar & Greg,

On Tue, Dec 14, 2021 at 5:48 AM Kumar Sankaran <ksankaran@...> wrote:
On Mon, Dec 13, 2021 at 6:56 PM Greg Favor <gfavor@...> wrote:
>
> On Mon, Dec 13, 2021 at 5:38 PM Ved Shanbhogue <ved@...> wrote:
>>
>> This was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement.
>
>
> I wouldn't view platform mandates of this sort as teaching, but as establishing a baseline that system integrators can depend on - by guiding the hardware developers as to what that expected baseline is.  (But I get your point.)
>
>>
>> My understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview.
>
>
> Agreed.
>
> The fundamental question is whether the goal of the platform spec is solely to ensure hardware-software interoperability and not to go further in ensuring other minimum capabilities that compliant platforms will provide.  What should be said and not said about RAS follows from that.
>
> Given that people are leaning towards the more limited scope or goal for the OS-A platforms, then that directly implies that there should be no requirements about what RAS features/coverage/etc. are actually implemented by compliant platforms.
>
> Greg

The intent of the platform spec is hardware-software interoperability.
I agree that dictating RAS hardware features is not within the scope
of the platform spec. However, we do want standards for RAS error
handling, error detection, logging/reporting and such. For example
using APEI to convey error information to OSPM is needed for software
interop.
So one suggestion is we remove specific errors like single-bit errors,
multi-bit errors and such and limit the features to error handling,
detection and logging/reporting.

If the content is worthwhile, please consider putting it in an informative section.  Content, such as discussed, might either become an (inline) application note—or go into a separate informative appendix that dives into the relationship between OS-A and RAS features. 

Philipp.