
Kumar Sankaran
Hi Ved, Are we ready to finalize the changes after all the comments and discussions on the list of questions you had on this thread? If yes, can you send a PR for these changes please? I see the PCIe INTx question is still open as per your last comment below. If you wish, we can keep the PCIe INTx question open while we resolve all the others.
Regards Kumar
toggle quoted message
Show quoted text
On Thu, Dec 16, 2021 at 1:23 PM Vedvyas Shanbhogue <ved@...> wrote: Greg HI -
On Tue, Dec 14, 2021 at 05:32:08PM -0800, Greg Favor wrote:
The following two items in Ved's email didn't get any response, so I offer my own below ...
On Sun, Dec 12, 2021 at 4:15 PM Vedvyas Shanbhogue <ved@...> wrote:
Section 2.3.7.3.2 - PCIe memory space: The requirement to not have any address translation for inbound accesses to any component in system address space is restrictive. If direct assignment of devices is supported then the IOMMU would be required to do the address translation for inbound accesses. Further for hart originated accesses where the PCIe memory is mapped into virtual address space there needs to be a translation through the first and/or second level page tables. Please help clarify why PCie memory must not be mapped into virtual address space and why use of IOMMU to do translation is disallowed by the specification.
I think where this came from is learnings in the ARM "server" ecosystem (as then got captured in SBSA). In particular, one wants devices and software on harts to have the same view of system physical address space so that, for example, pointers can be easily passed around. Which doesn't conflict with having address translation by IOMMUs. Maybe the current text needs to be better worded, but I think the ideas to be expressed are:
For inbound PCIe transactions:
- There should be no hardware modifications of PCIe addresses outside of an IOMMU (as some vendors way back in early ARM SBSA days were wont to do).
- If there is not an IOMMU associated with the PCIe interface, then PCIe devices will have the same view of PA space as the harts.
- If there is an IOMMU associated with the PCIe interface, then system software can trust that all address modifications are under its control via hart page tables and IOMMU page tables.
For outbound PCIe transactions, system software is free to set up VA-to-PA translations in hart page tables. I think the mandate against outbound address translation was accidentally mistaken. The key point is that there is one common view of system physical address space. Hart and IOMMU page tables may translate from hart VA's and device addresses to system physical address space, but the above ensures that "standard" system software has full control over this and doesn't have non-standard address transformations happening that it isn't aware of and doesn't know how to control.
Thanks. I think this is very clear.
Section 2.3.7.3.3 - PCIe interrupts: It seems unnecessary to require platforms built for the '22 version of the platform to have to support running software that is not MSI aware. Please clarify why supporting the INTx emulation for legacy/Pre-PCIe software compatibility a required and not an optional capability for RISC-v platforms? This one seems questionable to me as well, although I'm not the expert to reliably proclaim that INTx support is no longer a necessity in some server-class systems. I can imagine that back in earlier ARM "server" days this legacy issue was a bigger deal and hence was mandated in SBSA. But maybe that is no longer an issue? Or at least for 2022+ systems - to the point where mandating this legacy support is an unnecessary burden on many or the majority of such systems.
If this is a fair view going forward, then the INTx requirements should just become recommendations for systems that do feel the need to care about INTx support.
I think the recommendation could be changed to require MSI and make supporting INTx emulation optional. I am hoping to hear from BIOS and OS experts if we would need support OS/BIOS that are `22 platform compatible but are not MSI capable.
regards ved
-- Regards Kumar
|
|
Greg HI - On Tue, Dec 14, 2021 at 05:32:08PM -0800, Greg Favor wrote: The following two items in Ved's email didn't get any response, so I offer my own below ...
On Sun, Dec 12, 2021 at 4:15 PM Vedvyas Shanbhogue <ved@...> wrote:
Section 2.3.7.3.2 - PCIe memory space: The requirement to not have any address translation for inbound accesses to any component in system address space is restrictive. If direct assignment of devices is supported then the IOMMU would be required to do the address translation for inbound accesses. Further for hart originated accesses where the PCIe memory is mapped into virtual address space there needs to be a translation through the first and/or second level page tables. Please help clarify why PCie memory must not be mapped into virtual address space and why use of IOMMU to do translation is disallowed by the specification.
I think where this came from is learnings in the ARM "server" ecosystem (as then got captured in SBSA). In particular, one wants devices and software on harts to have the same view of system physical address space so that, for example, pointers can be easily passed around. Which doesn't conflict with having address translation by IOMMUs. Maybe the current text needs to be better worded, but I think the ideas to be expressed are:
For inbound PCIe transactions:
- There should be no hardware modifications of PCIe addresses outside of an IOMMU (as some vendors way back in early ARM SBSA days were wont to do).
- If there is not an IOMMU associated with the PCIe interface, then PCIe devices will have the same view of PA space as the harts.
- If there is an IOMMU associated with the PCIe interface, then system software can trust that all address modifications are under its control via hart page tables and IOMMU page tables.
For outbound PCIe transactions, system software is free to set up VA-to-PA translations in hart page tables. I think the mandate against outbound address translation was accidentally mistaken. The key point is that there is one common view of system physical address space. Hart and IOMMU page tables may translate from hart VA's and device addresses to system physical address space, but the above ensures that "standard" system software has full control over this and doesn't have non-standard address transformations happening that it isn't aware of and doesn't know how to control.
Thanks. I think this is very clear.
Section 2.3.7.3.3 - PCIe interrupts: It seems unnecessary to require platforms built for the '22 version of the platform to have to support running software that is not MSI aware. Please clarify why supporting the INTx emulation for legacy/Pre-PCIe software compatibility a required and not an optional capability for RISC-v platforms? This one seems questionable to me as well, although I'm not the expert to reliably proclaim that INTx support is no longer a necessity in some server-class systems. I can imagine that back in earlier ARM "server" days this legacy issue was a bigger deal and hence was mandated in SBSA. But maybe that is no longer an issue? Or at least for 2022+ systems - to the point where mandating this legacy support is an unnecessary burden on many or the majority of such systems.
If this is a fair view going forward, then the INTx requirements should just become recommendations for systems that do feel the need to care about INTx support.
I think the recommendation could be changed to require MSI and make supporting INTx emulation optional. I am hoping to hear from BIOS and OS experts if we would need support OS/BIOS that are `22 platform compatible but are not MSI capable. regards ved
|
|
The following two items in Ved's email didn't get any response, so I offer my own below ... On Sun, Dec 12, 2021 at 4:15 PM Vedvyas Shanbhogue < ved@...> wrote: Section 2.3.7.3.2 - PCIe memory space:
The requirement to not have any address translation for inbound accesses to any component in system address space is restrictive. If direct assignment of devices is supported then the IOMMU would be required to do the address translation for inbound accesses. Further for hart originated accesses where the PCIe memory is mapped into virtual address space there needs to be a translation through the first and/or second level page tables. Please help clarify why PCie memory must not be mapped into virtual address space and why use of IOMMU to do translation is disallowed by the specification.
I think where this came from is learnings in the ARM "server" ecosystem (as then got captured in SBSA). In particular, one wants devices and software on harts to have the same view of system physical address space so that, for example, pointers can be easily passed around. Which doesn't conflict with having address translation by IOMMUs. Maybe the current text needs to be better worded, but I think the ideas to be expressed are:
For inbound PCIe transactions:
- There should be no hardware modifications of PCIe addresses outside of an IOMMU (as some vendors way back in early ARM SBSA days were wont to do).
- If there is not an IOMMU associated with the PCIe interface, then PCIe devices will have the same view of PA space as the harts.
- If there is an IOMMU associated with the PCIe interface, then system software can trust that all address modifications are under its control via hart page tables and IOMMU page tables.
For outbound PCIe transactions, system software is free to set up VA-to-PA translations in hart page tables. I think the mandate against outbound address translation was accidentally mistaken. The key point is that there is one common view of system physical address space. Hart and IOMMU page tables may translate from hart VA's and device addresses to system physical address space, but the above ensures that "standard" system software has full control over this and doesn't have non-standard address transformations happening that it isn't aware of and doesn't know how to control.
Section 2.3.7.3.3 - PCIe interrupts:
It seems unnecessary to require platforms built for the '22 version of the platform to have to support running software that is not MSI aware. Please clarify why supporting the INTx emulation for legacy/Pre-PCIe software compatibility a required and not an optional capability for RISC-v platforms?
This one seems questionable to me as well, although I'm not the expert to reliably proclaim that INTx support is no longer a necessity in some server-class systems. I can imagine that back in earlier ARM "server" days this legacy issue was a bigger deal and hence was mandated in SBSA. But maybe that is no longer an issue? Or at least for 2022+ systems - to the point where mandating this legacy support is an unnecessary burden on many or the majority of such systems.
If this is a fair view going forward, then the INTx requirements should just become recommendations for systems that do feel the need to care about INTx support.
Greg
|
|

Kumar Sankaran
On Tue, Dec 14, 2021 at 7:14 AM Ved Shanbhogue <ved@...> wrote: On Mon, Dec 13, 2021 at 08:47:51PM -0800, Kumar Sankaran wrote:
So one suggestion is we remove specific errors like single-bit errors, multi-bit errors and such and limit the features to error handling, detection and logging/reporting.
So we could drop these statements: " - Main memory must be protected with SECDED-ECC. - All cache structures must be protected. - single-bit errors must be detected and corrected. - multi-bit errors can be detected and reported. "
And change this statement to drop the restriction to "these protected structures": "There must be memory-mapped RAS registers to log detected errors with information about the type and location of the error"
regards ved
Yes, fine by me. We can make the changes you have suggested above and leave the remaining content as is. -- Regards Kumar
|
|
Kumar & Greg,
On Tue, Dec 14, 2021 at 5:48 AM Kumar Sankaran < ksankaran@...> wrote: On Mon, Dec 13, 2021 at 6:56 PM Greg Favor <gfavor@...> wrote:
>
> On Mon, Dec 13, 2021 at 5:38 PM Ved Shanbhogue <ved@...> wrote:
>>
>> This was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement.
>
>
> I wouldn't view platform mandates of this sort as teaching, but as establishing a baseline that system integrators can depend on - by guiding the hardware developers as to what that expected baseline is. (But I get your point.)
>
>>
>> My understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview.
>
>
> Agreed.
>
> The fundamental question is whether the goal of the platform spec is solely to ensure hardware-software interoperability and not to go further in ensuring other minimum capabilities that compliant platforms will provide. What should be said and not said about RAS follows from that.
>
> Given that people are leaning towards the more limited scope or goal for the OS-A platforms, then that directly implies that there should be no requirements about what RAS features/coverage/etc. are actually implemented by compliant platforms.
>
> Greg
The intent of the platform spec is hardware-software interoperability.
I agree that dictating RAS hardware features is not within the scope
of the platform spec. However, we do want standards for RAS error
handling, error detection, logging/reporting and such. For example
using APEI to convey error information to OSPM is needed for software
interop.
So one suggestion is we remove specific errors like single-bit errors,
multi-bit errors and such and limit the features to error handling,
detection and logging/reporting.
If the content is worthwhile, please consider putting it in an informative section. Content, such as discussed, might either become an (inline) application note—or go into a separate informative appendix that dives into the relationship between OS-A and RAS features.
Philipp.
|
|
On Mon, Dec 13, 2021 at 08:47:51PM -0800, Kumar Sankaran wrote: So one suggestion is we remove specific errors like single-bit errors, multi-bit errors and such and limit the features to error handling, detection and logging/reporting.
So we could drop these statements: " - Main memory must be protected with SECDED-ECC. - All cache structures must be protected. - single-bit errors must be detected and corrected. - multi-bit errors can be detected and reported. " And change this statement to drop the restriction to "these protected structures": "There must be memory-mapped RAS registers to log detected errors with information about the type and location of the error" regards ved
|
|

Kumar Sankaran
On Mon, Dec 13, 2021 at 6:56 PM Greg Favor <gfavor@...> wrote: On Mon, Dec 13, 2021 at 5:38 PM Ved Shanbhogue <ved@...> wrote:
This was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement.
I wouldn't view platform mandates of this sort as teaching, but as establishing a baseline that system integrators can depend on - by guiding the hardware developers as to what that expected baseline is. (But I get your point.)
My understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview.
Agreed.
The fundamental question is whether the goal of the platform spec is solely to ensure hardware-software interoperability and not to go further in ensuring other minimum capabilities that compliant platforms will provide. What should be said and not said about RAS follows from that.
Given that people are leaning towards the more limited scope or goal for the OS-A platforms, then that directly implies that there should be no requirements about what RAS features/coverage/etc. are actually implemented by compliant platforms.
Greg
The intent of the platform spec is hardware-software interoperability. I agree that dictating RAS hardware features is not within the scope of the platform spec. However, we do want standards for RAS error handling, error detection, logging/reporting and such. For example using APEI to convey error information to OSPM is needed for software interop. So one suggestion is we remove specific errors like single-bit errors, multi-bit errors and such and limit the features to error handling, detection and logging/reporting. -- Regards Kumar
|
|
On Mon, Dec 13, 2021 at 5:38 PM Ved Shanbhogue < ved@...> wrote: This was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement.
I wouldn't view platform mandates of this sort as teaching, but as establishing a baseline that system integrators can depend on - by guiding the hardware developers as to what that expected baseline is. (But I get your point.) My understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview.
Agreed.
The fundamental question is whether the goal of the platform spec is solely to ensure hardware-software interoperability and not to go further in ensuring other minimum capabilities that compliant platforms will provide. What should be said and not said about RAS follows from that.
Given that people are leaning towards the more limited scope or goal for the OS-A platforms, then that directly implies that there should be no requirements about what RAS features/coverage/etc. are actually implemented by compliant platforms.
Greg
|
|
On Mon, Dec 13, 2021 at 05:11:38PM -0800, Greg Favor wrote: I think this whole RAS-related topic in the current platform draft was to establish some form of modest RAS requirement (versus no requirement) until a proper RAS arch spec exists. Although even then (assuming that arch spec is like x86 and ARM RAS specs that are just concerned with standardizing RAS registers for logging and the mechanisms for reporting errors), there still won't be any minimum requirement for actual error detection and correction.
I agree. I think the RAS ISA would want to be about standardized error logging and reporting but not mandate what errors are detected/corrected and how they are corrected or contained. For example, even in x86 and ARM space there are many product segments which have varying degrees of resilience but the RAS architecture flexibly covers the full spectrum of implementations between multiple x86 and ARM vendors. Fundamentally, should the Server platform spec mandate ANY error detection/correction requirements, or just leave it as a wild west among hardware developers to individually and eventually figure out where the line exists as far as the basic needs for RAS in *Server*-compliant platforms? And leave it for system integrators to discover that some Server-compliant hardware has less than "basic" RAS?
This was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement. BUT if the platform spec is ONLY trying to establish hardware/software interoperability, and not also match up hardware and software expectations regarding other areas of functionality such as RAS, then that answers the question. My own leaning is towards trying to address the latter versus the narrower view that the only concern is software interoperability. But I understand the arguments both ways.
My understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview. Aspects like "every cache must have single bit error correction" or "must implement SECDED-ECC" may not be necessary to acheive this objective. For example, an implementation may have two levels caches where instructions may be cached and for the lowest level the implementation may only implement parity but on a error refetch from a higher level cache or DDR where there might be ECC. So for such an implementation to require ECC in its instruction cache seems not required - the machine is meeting its FIT rate objectives through other means. regards ved
|
|
On Mon, Dec 13, 2021 at 2:22 PM Ved Shanbhogue < ved@...> wrote: >Mandate: *At a minimum, caching structures must be protected such that
>single-bit errors are detected and corrected by hardware.*
>
Would a mandate be overeaching and why limit it to caches then?
This was just trying to mandate a basic requirement and not go as far as requiring protection of all RAM-based structures - which some may view as overreach. Conversely I can understand that some people can view that "all caching structures" is already an overreach.
A product may define its reliability goals and may reason that a certain cache need not be protected due to various reasons like the technology in which the product is built, the altitude at which it is supposed to be used, the architectural vulnerability factor computed for that structure, etc.
I am failing to understand how would we be adding to or removing from the OS-A platform compatibility goals which is to be able to boot a shrink wrapper server operating system by trying to provide a mandate on how it implements reliability?
I think this whole RAS-related topic in the current platform draft was to establish some form of modest RAS requirement (versus no requirement) until a proper RAS arch spec exists. Although even then (assuming that arch spec is like x86 and ARM RAS specs that are just concerned with standardizing RAS registers for logging and the mechanisms for reporting errors), there still won't be any minimum requirement for actual error detection and correction.
Fundamentally, should the Server platform spec mandate ANY error detection/correction requirements, or just leave it as a wild west among hardware developers to individually and eventually figure out where the line exists as far as the basic needs for RAS in Server-compliant platforms? And leave it for system integrators to discover that some Server-compliant hardware has less than "basic" RAS?
BUT if the platform spec is ONLY trying to establish hardware/software interoperability, and not also match up hardware and software expectations regarding other areas of functionality such as RAS, then that answers the question. My own leaning is towards trying to address the latter versus the narrower view that the only concern is software interoperability. But I understand the arguments both ways.
Greg
|
|
On Mon, Dec 13, 2021 at 02:00:38PM -0800, Greg Favor wrote: On Mon, Dec 13, 2021 at 11:06 AM Vedvyas Shanbhogue <ved@...> wrote:
The current wording is the following. All cache structures must be protected. single-bit errors must be detected and corrected. multi-bit errors can be detected and reported. Platforms are free to implement more advanced features than the minimalistic requirements that are mandated here. So we should be OK. Agree? Could I suggest: "Cache structures must be protected to address the Failure-in-time (FIT) requirements. The protection mechanisms may included single-bit/multi-bit error detection and/or single/multi-bit error detection/correction schemes, replaying faulting instructions, lock-step execution, etc."
This seems like a toothless and qualitative mandate since no FIT requirements are specified. It can be a suggestion, although it's just a qualitative suggestion. It's essentially just saying "don't forget to consider FIT requirements". One can imagine a hundred such reminders that factor into high-end silicon design. Why highlight just this one?
The reference to "cache structures" is also incomplete - as well as ambiguous as to whether it refers just to caches (in the most popular sense of the word) or also to other caching structures like TLBs as well . Most all RAM-based structures in which an error can result in functional failure, need to be protected. Although one can take the view that the above text was just trying to express a minimum requirement that doesn't encompass all RAM-based structures. My suggestion would be something like the following two statements:
Totally agree that the term "cache structure" is ambigous and variety of caches may be built. How caches are built should also be transparent to the ISA, software, and the platform in general. Like you said reliability engineering is not something that affects software compatibility or hardware/software contracts. And as you rightly pointed out, caches are most obvious but a reliable system will need more such as right thermal engineering, stable clock/voltage delivery, right ageing guardbands, use of gray codes when appropriate, voltage monitors, timing margin sensors, protection on data/control buses, protection on register files, protection on internal data paths, etc. I would be totally okay with saying drop this whole paragraph. Mandate: *At a minimum, caching structures must be protected such that single-bit errors are detected and corrected by hardware.*
Would a mandate be overeaching and why limit it to caches then? A product may define its reliability goals and may reason that a certain cache need not be protected due to various reasons like the technology in which the product is built, the altitude at which it is supposed to be used, the architectural vulnerability factor computed for that structure, etc. I am failing to understand how would we be adding to or removing from the OS-A platform compatibility goals which is to be able to boot a shrink wrapper server operating system by trying to provide a mandate on how it implements reliability? regards ved
|
|
On Mon, Dec 13, 2021 at 11:06 AM Vedvyas Shanbhogue < ved@...> wrote: >The current wording is the following.
>All cache structures must be protected.
>single-bit errors must be detected and corrected.
>multi-bit errors can be detected and reported.
>Platforms are free to implement more advanced features than the
>minimalistic requirements that are mandated here. So we should be OK.
>Agree?
Could I suggest:
"Cache structures must be protected to address the Failure-in-time (FIT) requirements. The protection mechanisms may included single-bit/multi-bit error detection and/or single/multi-bit error detection/correction schemes, replaying faulting instructions, lock-step execution, etc."
This seems like a toothless and qualitative mandate since no FIT requirements are specified. It can be a suggestion, although it's just a qualitative suggestion. It's essentially just saying "don't forget to consider FIT requirements". One can imagine a hundred such reminders that factor into high-end silicon design. Why highlight just this one?
The reference to "cache structures" is also incomplete - as well as ambiguous as to whether it refers just to caches (in the most popular sense of the word) or also to other caching structures like TLBs as well . Most all RAM-based structures in which an error can result in functional failure, need to be protected. Although one can take the view that the above text was just trying to express a minimum requirement that doesn't encompass all RAM-based structures. My suggestion would be something like the following two statements:
Mandate: At a minimum, caching structures must be protected such that single-bit errors are detected and corrected by hardware.
Recommendation: Depending on FIT rate requirements, more advanced protection, more complete protection coverage of other structures, and/or more features may be necessary (starting with at least SECDED ECC on caching structures holding locally modified data).
Greg
|
|
On Mon, Dec 13, 2021 at 11:16:49AM -0800, Kumar Sankaran wrote: Would you be OK to send out a patch to the mailing list for these 3 changes and then subsequently a PR to the platform git on github? Let me know if you need any help with this.
Will be glad to. Agree. I suggest we keep it high level and simply say "Cache structures must be protected to address the Failure-in-time (FIT) requirements. The protection mechanisms may include single-bit/multi-bit error detection and/or single/multi-bit error detection/correction schemes".
Yes, that sounds good. regards ved
|
|

Kumar Sankaran
Thanks Ved. Minor nits below. Would you be OK to send out a patch to the mailing list for these 3 changes and then subsequently a PR to the platform git on github? Let me know if you need any help with this. On Mon, Dec 13, 2021 at 11:06 AM Ved Shanbhogue <ved@...> wrote: On Mon, Dec 13, 2021 at 10:44:41AM -0800, Kumar Sankaran wrote:
Will wording this as "Main memory must be protected with SECDED-ECC at the minimum or a stronger/advanced method of protection" suffice? Thanks. Yes.
The current wording is the following. All cache structures must be protected. single-bit errors must be detected and corrected. multi-bit errors can be detected and reported. Platforms are free to implement more advanced features than the minimalistic requirements that are mandated here. So we should be OK. Agree?
Could I suggest: "Cache structures must be protected to address the Failure-in-time (FIT) requirements. The protection mechanisms may included single-bit/multi-bit error detection and/or single/multi-bit error detection/correction schemes, replaying faulting instructions, lock-step execution, etc."
Agree. I suggest we keep it high level and simply say "Cache structures must be protected to address the Failure-in-time (FIT) requirements. The protection mechanisms may include single-bit/multi-bit error detection and/or single/multi-bit error detection/correction schemes".
The current wording is "The platform should provide the capability to configure each RAS error to trigger firmware-first or OS-first error interrupt". Will this suffice or do we need to add more clarity? Could I suggest: "The platform should provide capability to configure RAS errors to trigger firmware-first or OS-first error interrupts."
Agree. regards ved
-- Regards Kumar
|
|
On Mon, Dec 13, 2021 at 8:30 AM Anup Patel < anup@...> wrote: On Mon, Dec 13, 2021 at 6:46 PM Ved Shanbhogue <ved@...> wrote:
>
> Hi Anup
> On Mon, Dec 13, 2021 at 09:25:43AM +0530, Anup Patel wrote:
> >>
> >> Section 2.1.4.2.4:
> >> "Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities.
> >I think the confusion here is because the RISC-V platform
> >specification uses the term "software interrupt" for both
> >"inter-processor interrupt" and "minor identities 3, 1, and 2". I
> >suggest using the term "inter-processor interrupt" at most places and
> >only use the term "software interrupt" in-context of ACLINT MSWI or
> >SSWI devices.
> >
> Yes, that was my conclusion that "software interrupt" here was used to mean an IPI. I think clearing this up would be helpful.
>
> >>
> >> Section 2.1.7.1:
> >> Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported?
> >
> >I agree, this text needs to be improved because now Base and Server
> >are separate platforms. Since, the Server platform mandates IMSIC and
> >Priv Sstc extension so SBI TIME, IPI and RFENCE can be optional but
> >this is not true for the Base platform.
> >
> Yes, however the Server is additived to the base as written. Even for base, if Sstc and IMSIC are supported then SBI TIME, IPI, and RFENCE can be optional.
The current text/organization is going to change (as discussed in
previous meetings). The Server platform will be a separate platform
independent of the Base platform because some of the requirements will
be different for both platforms. (@Kumar/Atish please add if I missed
anything)
For the Base platform, I agree we can make SBI TIME, IPI and RFENCE
mandatory only when IMSIC and Sstc is not present. (@Atish do you
recall any other rationale in this context ?)
We should have a table with dependencies for SBI extensions. E.g. SBI Time only required if sstc is not present SBI IPI/RFENCE is only required if IMSIC or SSWI is not present
I will send a patch after the spec is broken into separate platforms (OS-A server and OS-A embedded)
Regards,
Anup
>
> regards
> ved
|
|
On Mon, Dec 13, 2021 at 10:44:41AM -0800, Kumar Sankaran wrote: Will wording this as "Main memory must be protected with SECDED-ECC at the minimum or a stronger/advanced method of protection" suffice? Thanks. Yes. The current wording is the following. All cache structures must be protected. single-bit errors must be detected and corrected. multi-bit errors can be detected and reported. Platforms are free to implement more advanced features than the minimalistic requirements that are mandated here. So we should be OK. Agree?
Could I suggest: "Cache structures must be protected to address the Failure-in-time (FIT) requirements. The protection mechanisms may included single-bit/multi-bit error detection and/or single/multi-bit error detection/correction schemes, replaying faulting instructions, lock-step execution, etc." The current wording is "The platform should provide the capability to configure each RAS error to trigger firmware-first or OS-first error interrupt". Will this suffice or do we need to add more clarity? Could I suggest: "The platform should provide capability to configure RAS errors to trigger firmware-first or OS-first error interrupts." regards ved
|
|

Kumar Sankaran
On Sun, Dec 12, 2021 at 4:15 PM Vedvyas Shanbhogue <ved@...> wrote: Greetings All!
Please help with the following questions about the OS-A/server extension: Section 2.1.4.1 - Timer support: Should the ACLINT MTIMER support should be optional or moved into the M-platform section or made optional for the server extension?
Section 2.1.4.2.4: "Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities.
Section 2.1.7.1: Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported?
Section 2.3.7.3.2 - PCIe memory space: The requirement to not have any address translation for inbound accesses to any component in system address space is restrictive. If direct assignment of devices is supported then the IOMMU would be required to do the address translation for inbound accesses. Further for hart originated accesses where the PCIe memory is mapped into virtual address space there needs to be a translation through the first and/or second level page tables. Please help clarify why PCie memory must not be mapped into virtual address space and why use of IOMMU to do translation is disallowed by the specification.
Section 2.3.7.3.3 - PCIe interrupts: It seems unnecessary to require platforms built for the '22 version of the platform to have to support running software that is not MSI aware. Please clarify why supporting the INTx emulation for legacy/Pre-PCIe software compatibility a required and not an optional capability for RISC-v platforms?
Section 2.3.9: For many platforms especially server class platforms SECDED-ECC is not sufficient protection for main memory. Why is the platform restricting a platform to only support SECDED-ECC? Is it a violation of the platform specification if the platform supports stronger or more advanced ECC than SECDEC-ECC?
Agree. The intent here was to mandate a minimal set of memory protection features for server class platforms. It is not a violation of the platform spec to have something better. As per the discussions within the group, the RAS specification is something that needs to be taken up by a TG within RISC-V and driven to completion. At this point in time, we don't have a RAS spec, while at the same time, we didn't want to leave this topic completely off of the platform spec specially for servers. Will wording this as "Main memory must be protected with SECDED-ECC at the minimum or a stronger/advanced method of protection" suffice? Which caches are/should be protected and how is usually a function of the technology used by the cache, the operating conditions, and the SDC/DUE goals of the platform and are established based on FIT modeling for that platform. Please clarify rationale for requiring single-bit error correction on all caches? Also please clarify why the spec dont allow for correcting multi-bit errors; based on the SDC goals some caches may need to support for e.g. triple error detection and double error correction..
The rationale was to have a minimal set of RAS requirements until we have a proper RISC-V RAS spec that we can refer to. Hence, having a single-bit error correction was a minimalistic requirement. It is not a platform spec violation to have the ability to correct multi-bit errors. The current wording is the following. All cache structures must be protected. single-bit errors must be detected and corrected. multi-bit errors can be detected and reported. Platforms are free to implement more advanced features than the minimalistic requirements that are mandated here. So we should be OK. Agree? The firmware-first model seems to require a configuration per RAS-event/RAS-error-source to trigger firmware-first or OS-first. There may be hundreds of RAS events/errors that a platform supports. Why is it required to support per-event/per-error selectivity vs. a two level selection where all RAS errors are either handled by firmware-first or handled by OS-first.
Yes, there may be hundreds of RAS errors/events. The intent here was that at the lowest level of granularity, we should be able to selectively route each of these to the respective software/firmware entity. So yes, we could add additional gates on top like the two level selection you have suggested but the platform spec is simply conveying the expected support at the lowest level. The current wording is "The platform should provide the capability to configure each RAS error to trigger firmware-first or OS-first error interrupt". Will this suffice or do we need to add more clarity? regards ved
-- Regards Kumar
|
|

Kumar Sankaran
On Mon, Dec 13, 2021 at 8:30 AM Anup Patel <anup@...> wrote: On Mon, Dec 13, 2021 at 6:46 PM Ved Shanbhogue <ved@...> wrote:
Hi Anup On Mon, Dec 13, 2021 at 09:25:43AM +0530, Anup Patel wrote:
Section 2.1.4.2.4: "Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities. I think the confusion here is because the RISC-V platform specification uses the term "software interrupt" for both "inter-processor interrupt" and "minor identities 3, 1, and 2". I suggest using the term "inter-processor interrupt" at most places and only use the term "software interrupt" in-context of ACLINT MSWI or SSWI devices.
Yes, that was my conclusion that "software interrupt" here was used to mean an IPI. I think clearing this up would be helpful.
Section 2.1.7.1: Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported? I agree, this text needs to be improved because now Base and Server are separate platforms. Since, the Server platform mandates IMSIC and Priv Sstc extension so SBI TIME, IPI and RFENCE can be optional but this is not true for the Base platform.
Yes, however the Server is additived to the base as written. Even for base, if Sstc and IMSIC are supported then SBI TIME, IPI, and RFENCE can be optional. The current text/organization is going to change (as discussed in previous meetings). The Server platform will be a separate platform independent of the Base platform because some of the requirements will be different for both platforms. (@Kumar/Atish please add if I missed anything)
Yes, as per our agreement during the Platform HSC meeting several weeks back, the plan is to make the OS-A Embedded and OS-A Server as individual platforms without any relationship to each other. Common requirements between OS-A Embedded and OS-A Server will be put into a new section called OS-A Common Requirements. This way, we can have separate requirements for each platform independent of the other. So the OS-A Server will NOT be an extension of OS-A Embedded anymore but a separate platform. For the Base platform, I agree we can make SBI TIME, IPI and RFENCE mandatory only when IMSIC and Sstc is not present. (@Atish do you recall any other rationale in this context ?)
Regards, Anup
regards ved
-- Regards Kumar
|
|

Anup Patel
On Mon, Dec 13, 2021 at 6:46 PM Ved Shanbhogue <ved@...> wrote: Hi Anup On Mon, Dec 13, 2021 at 09:25:43AM +0530, Anup Patel wrote:
Section 2.1.4.2.4: "Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities. I think the confusion here is because the RISC-V platform specification uses the term "software interrupt" for both "inter-processor interrupt" and "minor identities 3, 1, and 2". I suggest using the term "inter-processor interrupt" at most places and only use the term "software interrupt" in-context of ACLINT MSWI or SSWI devices.
Yes, that was my conclusion that "software interrupt" here was used to mean an IPI. I think clearing this up would be helpful.
Section 2.1.7.1: Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported? I agree, this text needs to be improved because now Base and Server are separate platforms. Since, the Server platform mandates IMSIC and Priv Sstc extension so SBI TIME, IPI and RFENCE can be optional but this is not true for the Base platform.
Yes, however the Server is additived to the base as written. Even for base, if Sstc and IMSIC are supported then SBI TIME, IPI, and RFENCE can be optional.
The current text/organization is going to change (as discussed in previous meetings). The Server platform will be a separate platform independent of the Base platform because some of the requirements will be different for both platforms. (@Kumar/Atish please add if I missed anything) For the Base platform, I agree we can make SBI TIME, IPI and RFENCE mandatory only when IMSIC and Sstc is not present. (@Atish do you recall any other rationale in this context ?) Regards, Anup regards ved
|
|
Hi Anup On Mon, Dec 13, 2021 at 09:25:43AM +0530, Anup Patel wrote: Section 2.1.4.2.4: "Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities. I think the confusion here is because the RISC-V platform specification uses the term "software interrupt" for both "inter-processor interrupt" and "minor identities 3, 1, and 2". I suggest using the term "inter-processor interrupt" at most places and only use the term "software interrupt" in-context of ACLINT MSWI or SSWI devices.
Yes, that was my conclusion that "software interrupt" here was used to mean an IPI. I think clearing this up would be helpful. Section 2.1.7.1: Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported? I agree, this text needs to be improved because now Base and Server are separate platforms. Since, the Server platform mandates IMSIC and Priv Sstc extension so SBI TIME, IPI and RFENCE can be optional but this is not true for the Base platform.
Yes, however the Server is additived to the base as written. Even for base, if Sstc and IMSIC are supported then SBI TIME, IPI, and RFENCE can be optional. regards ved
|
|