Date
1 - 20 of 24
Platform specification questions
Ved Shanbhogue
Greetings All!
Please help with the following questions about the OS-A/server extension: Section 2.1.4.1 - Timer support: Should the ACLINT MTIMER support should be optional or moved into the M-platform section or made optional for the server extension? Section 2.1.4.2.4: "Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities. Section 2.1.7.1: Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported? Section 2.3.7.3.2 - PCIe memory space: The requirement to not have any address translation for inbound accesses to any component in system address space is restrictive. If direct assignment of devices is supported then the IOMMU would be required to do the address translation for inbound accesses. Further for hart originated accesses where the PCIe memory is mapped into virtual address space there needs to be a translation through the first and/or second level page tables. Please help clarify why PCie memory must not be mapped into virtual address space and why use of IOMMU to do translation is disallowed by the specification. Section 2.3.7.3.3 - PCIe interrupts: It seems unnecessary to require platforms built for the '22 version of the platform to have to support running software that is not MSI aware. Please clarify why supporting the INTx emulation for legacy/Pre-PCIe software compatibility a required and not an optional capability for RISC-v platforms? Section 2.3.9: For many platforms especially server class platforms SECDED-ECC is not sufficient protection for main memory. Why is the platform restricting a platform to only support SECDED-ECC? Is it a violation of the platform specification if the platform supports stronger or more advanced ECC than SECDEC-ECC? Which caches are/should be protected and how is usually a function of the technology used by the cache, the operating conditions, and the SDC/DUE goals of the platform and are established based on FIT modeling for that platform. Please clarify rationale for requiring single-bit error correction on all caches? Also please clarify why the spec dont allow for correcting multi-bit errors; based on the SDC goals some caches may need to support for e.g. triple error detection and double error correction.. The firmware-first model seems to require a configuration per RAS-event/RAS-error-source to trigger firmware-first or OS-first. There may be hundreds of RAS events/errors that a platform supports. Why is it required to support per-event/per-error selectivity vs. a two level selection where all RAS errors are either handled by firmware-first or handled by OS-first. regards ved |
|
Hi Ved,
Please see comments inline below ... Regards, Anup On Mon, Dec 13, 2021 at 5:45 AM Vedvyas Shanbhogue <ved@...> wrote: The RISC-V Privileged v1.12 defines MTIME and MTIMECMP as platform specific memory-mapped registers in "Section 3.2 Machine-Level Memory-Mapped Registers". This means the RISC-V platform specification needs to standardize memory layout and arrangement of the MTIME and MTIMECMP memory-mapped registers which is what ACLINT MTIMER specification does. (Refer, https://github.com/riscv/riscv-isa-manual/releases/download/Priv-v1.12/riscv-privileged-20211203.pdf) Since, both OS-A platforms with M-mode and M platforms need ACLINT MTIMER so I suggest that OS-A platforms should say "If M-mode is implemented then ACLINT MTIMER should be supported ...". The AIA IMSIC devices do not provide interrupts via minor identities 3, 1, and 2. Both AIA IMSIC and APLIC devices only deal with minor identities 9, 10, and 11 (i.e. external interrupts) whereas the ACLINT specification defines devices that deal with minor identities 1, 3, and 7. For software, the "inter-processor interrupts" (at M-mode or S-mode) can be implemented: 1) Using ACLINT MSWI or SSW devices OR 2) Using AIA IMSIC devices I think the confusion here is because the RISC-V platform specification uses the term "software interrupt" for both "inter-processor interrupt" and "minor identities 3, 1, and 2". I suggest using the term "inter-processor interrupt" at most places and only use the term "software interrupt" in-context of ACLINT MSWI or SSWI devices. I agree, this text needs to be improved because now Base and Server are separate platforms. Since, the Server platform mandates IMSIC and Priv Sstc extension so SBI TIME, IPI and RFENCE can be optional but this is not true for the Base platform.
|
|
Greg Favor
On Sun, Dec 12, 2021 at 7:55 PM Anup Patel <anup@...> wrote: On Mon, Dec 13, 2021 at 5:45 AM Vedvyas Shanbhogue <ved@...> wrote: Since, both OS-A platforms with M-mode and M platforms need ACLINT Here's a response from a different angle. MTIME matters to the SEE because it provides the timebase that is then seen by all harts in their 'time' CSRs (via the RDTIME pseudoinstruction). But if the initial OS-A platform specs are going to drop any M-mode standardization/etc., then it seems like the thing to do - from the SEE and OS-A platform perspectives - is to abstract MTIME as just the "system timebase that propagates to all harts and is seen by S/HS/U mode software in the form of the 'time' CSR" (just as the Unpriv spec does in its own words). Whatever would be said about MTIME and tick period constraints (e.g. a minimum tick period) would instead be expressed wrt this abstracted timebase - which the Unpriv spec refers to as "wall-clock
real time that has passed from an arbitrary start time in the past. .... The execution environment should provide a means of determining the period of a counter tick (seconds/tick). ...". This separates out from the current OS-A platform specs the ACLINT MTIMER device as a standardized Machine-level implementation of the MTIME and MTIMECMP registers defined in the Priv spec. Now, for systems that implement Priv 1.12 and the Sstc extension, and actually use the Sstc extension, then this can be the end of the story. But for today's systems and for future systems that don't implement Sstc (unless all OS-A 2022 platform specs were to mandate Sstc support and eliminate any possibility of existing systems complying with at least the Embedded (i.e. old "Base") OS-A platform spec), they also need the SBI API that provides Supervisor timer functionality to S/HS mode (with M-mode using MTIME and MTIMECMP to provide that functionality). While this is also an SEE interface, talking about this does start to sneak up on talking about MTIME. But then again one could still abstract MTIME as the system timebase, and MTIMECMP as a timebase compare value. Greg |
|
Ved Shanbhogue
On Sun, Dec 12, 2021 at 7:55 PM Anup Patel <anup@...> wrote:I was thinking along the lines of how Greg was thinking here.Since, both OS-A platforms with M-mode and M platforms need ACLINT Agree. But for today's systems and for future systems that don't implement SstcAgree. regards ved |
|
Ved Shanbhogue
Hi Anup
On Mon, Dec 13, 2021 at 09:25:43AM +0530, Anup Patel wrote: Yes, that was my conclusion that "software interrupt" here was used to mean an IPI. I think clearing this up would be helpful.I think the confusion here is because the RISC-V platform Yes, however the Server is additived to the base as written. Even for base, if Sstc and IMSIC are supported then SBI TIME, IPI, and RFENCE can be optional.I agree, this text needs to be improved because now Base and Server regards ved |
|
On Mon, Dec 13, 2021 at 6:46 PM Ved Shanbhogue <ved@...> wrote:
The current text/organization is going to change (as discussed in previous meetings). The Server platform will be a separate platform independent of the Base platform because some of the requirements will be different for both platforms. (@Kumar/Atish please add if I missed anything) For the Base platform, I agree we can make SBI TIME, IPI and RFENCE mandatory only when IMSIC and Sstc is not present. (@Atish do you recall any other rationale in this context ?) Regards, Anup
|
|
On Mon, Dec 13, 2021 at 8:30 AM Anup Patel <anup@...> wrote:
Yes, as per our agreement during the Platform HSC meeting several weeks back, the plan is to make the OS-A Embedded and OS-A Server as individual platforms without any relationship to each other. Common requirements between OS-A Embedded and OS-A Server will be put into a new section called OS-A Common Requirements. This way, we can have separate requirements for each platform independent of the other. So the OS-A Server will NOT be an extension of OS-A Embedded anymore but a separate platform. For the Base platform, I agree we can make SBI TIME, IPI and RFENCE -- Regards Kumar |
|
On Sun, Dec 12, 2021 at 4:15 PM Vedvyas Shanbhogue <ved@...> wrote:
Agree. The intent here was to mandate a minimal set of memory protection features for server class platforms. It is not a violation of the platform spec to have something better. As per the discussions within the group, the RAS specification is something that needs to be taken up by a TG within RISC-V and driven to completion. At this point in time, we don't have a RAS spec, while at the same time, we didn't want to leave this topic completely off of the platform spec specially for servers. Will wording this as "Main memory must be protected with SECDED-ECC at the minimum or a stronger/advanced method of protection" suffice? The rationale was to have a minimal set of RAS requirements until we have a proper RISC-V RAS spec that we can refer to. Hence, having a single-bit error correction was a minimalistic requirement. It is not a platform spec violation to have the ability to correct multi-bit errors. The current wording is the following. All cache structures must be protected. single-bit errors must be detected and corrected. multi-bit errors can be detected and reported. Platforms are free to implement more advanced features than the minimalistic requirements that are mandated here. So we should be OK. Agree? Yes, there may be hundreds of RAS errors/events. The intent here was that at the lowest level of granularity, we should be able to selectively route each of these to the respective software/firmware entity. So yes, we could add additional gates on top like the two level selection you have suggested but the platform spec is simply conveying the expected support at the lowest level. The current wording is "The platform should provide the capability to configure each RAS error to trigger firmware-first or OS-first error interrupt". Will this suffice or do we need to add more clarity?
-- Regards Kumar |
|
Ved Shanbhogue
On Mon, Dec 13, 2021 at 10:44:41AM -0800, Kumar Sankaran wrote:
Will wording this as "Main memory must be protected with SECDED-ECC atThanks. Yes. The current wording is the following.Could I suggest: "Cache structures must be protected to address the Failure-in-time (FIT) requirements. The protection mechanisms may included single-bit/multi-bit error detection and/or single/multi-bit error detection/correction schemes, replaying faulting instructions, lock-step execution, etc." The current wording is "The platform should provide the capability toCould I suggest: "The platform should provide capability to configure RAS errors to trigger firmware-first or OS-first error interrupts." regards ved |
|
atishp@...
On Mon, Dec 13, 2021 at 8:30 AM Anup Patel <anup@...> wrote: On Mon, Dec 13, 2021 at 6:46 PM Ved Shanbhogue <ved@...> wrote: We should have a table with dependencies for SBI extensions. E.g. SBI Time only required if sstc is not present SBI IPI/RFENCE is only required if IMSIC or SSWI is not present I will send a patch after the spec is broken into separate platforms (OS-A server and OS-A embedded) Regards, |
|
Thanks Ved. Minor nits below.
Would you be OK to send out a patch to the mailing list for these 3 changes and then subsequently a PR to the platform git on github? Let me know if you need any help with this. On Mon, Dec 13, 2021 at 11:06 AM Ved Shanbhogue <ved@...> wrote: Agree. I suggest we keep it high level and simply say "Cache structures must be protected to address the Failure-in-time (FIT) requirements. The protection mechanisms may include single-bit/multi-bit error detection and/or single/multi-bit error detection/correction schemes". Agree.The current wording is "The platform should provide the capability toCould I suggest:
-- Regards Kumar |
|
Ved Shanbhogue
On Mon, Dec 13, 2021 at 11:16:49AM -0800, Kumar Sankaran wrote:
Would you be OK to send out a patch to the mailing list for these 3Will be glad to. Agree. I suggest we keep it high level and simply say "CacheYes, that sounds good. regards ved |
|
Greg Favor
On Mon, Dec 13, 2021 at 11:06 AM Vedvyas Shanbhogue <ved@...> wrote: >The current wording is the following. This seems like a toothless and qualitative mandate since no FIT requirements are specified. It can be a suggestion, although it's just a qualitative suggestion. It's essentially just saying "don't forget to consider FIT requirements". One can imagine a hundred such reminders that factor into high-end silicon design. Why highlight just this one? The reference to "cache structures" is also incomplete - as well as ambiguous as to whether it refers just to caches (in the most popular sense of the word) or also to other caching structures like TLBs as well . Most all RAM-based structures in which an error can result in functional failure, need to be protected. Although one can take the view that the above text was just trying to express a minimum requirement that doesn't encompass all RAM-based structures. My suggestion would be something like the following two statements: Mandate: At a minimum, caching structures must be protected such that single-bit errors are detected and corrected by hardware. Recommendation: Depending on FIT rate requirements, more advanced protection, more complete protection coverage of other structures, and/or more features may be necessary (starting with at least SECDED ECC on caching structures holding locally modified data). Greg |
|
Ved Shanbhogue
On Mon, Dec 13, 2021 at 02:00:38PM -0800, Greg Favor wrote:
On Mon, Dec 13, 2021 at 11:06 AM Vedvyas Shanbhogue <ved@...>Totally agree that the term "cache structure" is ambigous and variety of caches may be built. How caches are built should also be transparent to the ISA, software, and the platform in general. Like you said reliability engineering is not something that affects software compatibility or hardware/software contracts. And as you rightly pointed out, caches are most obvious but a reliable system will need more such as right thermal engineering, stable clock/voltage delivery, right ageing guardbands, use of gray codes when appropriate, voltage monitors, timing margin sensors, protection on data/control buses, protection on register files, protection on internal data paths, etc. I would be totally okay with saying drop this whole paragraph. Mandate: *At a minimum, caching structures must be protected such thatWould a mandate be overeaching and why limit it to caches then? A product may define its reliability goals and may reason that a certain cache need not be protected due to various reasons like the technology in which the product is built, the altitude at which it is supposed to be used, the architectural vulnerability factor computed for that structure, etc. I am failing to understand how would we be adding to or removing from the OS-A platform compatibility goals which is to be able to boot a shrink wrapper server operating system by trying to provide a mandate on how it implements reliability? regards ved |
|
Greg Favor
On Mon, Dec 13, 2021 at 2:22 PM Ved Shanbhogue <ved@...> wrote: >Mandate: *At a minimum, caching structures must be protected such that This was just trying to mandate a basic requirement and not go as far as requiring protection of all RAM-based structures - which some may view as overreach. Conversely I can understand that some people can view that "all caching structures" is already an overreach. A product may define its reliability goals and may reason that a certain cache need not be protected due to various reasons like the technology in which the product is built, the altitude at which it is supposed to be used, the architectural vulnerability factor computed for that structure, etc. I think this whole RAS-related topic in the current platform draft was to establish some form of modest RAS requirement (versus no requirement) until a proper RAS arch spec exists. Although even then (assuming that arch spec is like x86 and ARM RAS specs that are just concerned with standardizing RAS registers for logging and the mechanisms for reporting errors), there still won't be any minimum requirement for actual error detection and correction. Fundamentally, should the Server platform spec mandate ANY error detection/correction requirements, or just leave it as a wild west among hardware developers to individually and eventually figure out where the line exists as far as the basic needs for RAS in Server-compliant platforms? And leave it for system integrators to discover that some Server-compliant hardware has less than "basic" RAS? BUT if the platform spec is ONLY trying to establish hardware/software interoperability, and not also match up hardware and software expectations regarding other areas of functionality such as RAS, then that answers the question. My own leaning is towards trying to address the latter versus the narrower view that the only concern is software interoperability. But I understand the arguments both ways. Greg |
|
Ved Shanbhogue
On Mon, Dec 13, 2021 at 05:11:38PM -0800, Greg Favor wrote:
I think this whole RAS-related topic in the current platform draft was toI agree. I think the RAS ISA would want to be about standardized error logging and reporting but not mandate what errors are detected/corrected and how they are corrected or contained. For example, even in x86 and ARM space there are many product segments which have varying degrees of resilience but the RAS architecture flexibly covers the full spectrum of implementations between multiple x86 and ARM vendors. Fundamentally, should the Server platform spec mandate ANY errorThis was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement. BUT if the platform spec is ONLY trying to establish hardware/softwareMy understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview. Aspects like "every cache must have single bit error correction" or "must implement SECDED-ECC" may not be necessary to acheive this objective. For example, an implementation may have two levels caches where instructions may be cached and for the lowest level the implementation may only implement parity but on a error refetch from a higher level cache or DDR where there might be ECC. So for such an implementation to require ECC in its instruction cache seems not required - the machine is meeting its FIT rate objectives through other means. regards ved |
|
Greg Favor
On Mon, Dec 13, 2021 at 5:38 PM Ved Shanbhogue <ved@...> wrote: This was one of the source of my questions. If the platform specifications intent is to specify the SEE, ISA and non-ISA hardware - the hardware/software contract - as visible to software so that a shrink wrapped operating system can load then I would say its not the platform specifications role to teach how to design resilient hardware. If the goal of the platform specification is to teach hardware designers about how to design resilient hardware then I think the specification falls short in many ways...I think you also hit upon that in the next statement. I wouldn't view platform mandates of this sort as teaching, but as establishing a baseline that system integrators can depend on - by guiding the hardware developers as to what that expected baseline is. (But I get your point.) My understanding was the former i.e. establishing the standard for hardware-software interoperability. Specifically in areas of RAS I think where the interoperability is required - e.g. standardized logging/reporting, redirecting reporting to firmware-first, etc. I think should be in the purview. Agreed. The fundamental question is whether the goal of the platform spec is solely to ensure hardware-software interoperability and not to go further in ensuring other minimum capabilities that compliant platforms will provide. What should be said and not said about RAS follows from that. Given that people are leaning towards the more limited scope or goal for the OS-A platforms, then that directly implies that there should be no requirements about what RAS features/coverage/etc. are actually implemented by compliant platforms. Greg |
|
On Mon, Dec 13, 2021 at 6:56 PM Greg Favor <gfavor@...> wrote:
The intent of the platform spec is hardware-software interoperability. I agree that dictating RAS hardware features is not within the scope of the platform spec. However, we do want standards for RAS error handling, error detection, logging/reporting and such. For example using APEI to convey error information to OSPM is needed for software interop. So one suggestion is we remove specific errors like single-bit errors, multi-bit errors and such and limit the features to error handling, detection and logging/reporting. -- Regards Kumar |
|
Ved Shanbhogue
On Mon, Dec 13, 2021 at 08:47:51PM -0800, Kumar Sankaran wrote:
So we could drop these statements: " - Main memory must be protected with SECDED-ECC. - All cache structures must be protected. - single-bit errors must be detected and corrected. - multi-bit errors can be detected and reported. " And change this statement to drop the restriction to "these protected structures": "There must be memory-mapped RAS registers to log detected errors with information about the type and location of the error" regards ved |
|
Philipp Tomsich
Kumar & Greg, On Tue, Dec 14, 2021 at 5:48 AM Kumar Sankaran <ksankaran@...> wrote: On Mon, Dec 13, 2021 at 6:56 PM Greg Favor <gfavor@...> wrote: If the content is worthwhile, please consider putting it in an informative section. Content, such as discussed, might either become an (inline) application note—or go into a separate informative appendix that dives into the relationship between OS-A and RAS features. Philipp. |
|