On Sun, Dec 12, 2021 at 4:15 PM Vedvyas Shanbhogue <ved@...> wrote:
Greetings All!
Please help with the following questions about the OS-A/server extension:
Section 2.1.4.1 - Timer support:
Should the ACLINT MTIMER support should be optional or moved into the M-platform section or made optional for the server extension?
Section 2.1.4.2.4:
"Software interrupts for M-mode, HS-mode and VS-mode are supported using AIA IMSIC devices". Is the term "software interrupt" here intended to be the minor identities 3, 1, and 2? If so please clarify what the support in IMSIC is expected to be for these minor identities.
Section 2.1.7.1:
Is supporting SBI TIME optional if the server extension is supported as server extension requires Sstc? Is supporting SBI IPI optional if AiA IMSIC is supported?
Section 2.3.7.3.2 - PCIe memory space:
The requirement to not have any address translation for inbound accesses to any component in system address space is restrictive. If direct assignment of devices is supported then the IOMMU would be required to do the address translation for inbound accesses. Further for hart originated accesses where the PCIe memory is mapped into virtual address space there needs to be a translation through the first and/or second level page tables. Please help clarify why PCie memory must not be mapped into virtual address space and why use of IOMMU to do translation is disallowed by the specification.
Section 2.3.7.3.3 - PCIe interrupts:
It seems unnecessary to require platforms built for the '22 version of the platform to have to support running software that is not MSI aware. Please clarify why supporting the INTx emulation for legacy/Pre-PCIe software compatibility a required and not an optional capability for RISC-v platforms?
Section 2.3.9:
For many platforms especially server class platforms SECDED-ECC is not sufficient protection for main memory. Why is the platform restricting a platform to only support SECDED-ECC? Is it a violation of the platform specification if the platform supports stronger or more advanced ECC than SECDEC-ECC?
Agree. The intent here was to mandate a minimal set of memory
protection features for server class platforms. It is not a violation
of the platform spec to have something better. As per the discussions
within the group, the RAS specification is something that needs to be
taken up by a TG within RISC-V and driven to completion. At this point
in time, we don't have a RAS spec, while at the same time, we didn't
want to leave this topic completely off of the platform spec specially
for servers.
Will wording this as "Main memory must be protected with SECDED-ECC at
the minimum or a stronger/advanced method of protection" suffice?
Which caches are/should be protected and how is usually a function of the technology used by the cache, the operating conditions, and the SDC/DUE goals of the platform and are established based on FIT modeling for that platform. Please clarify rationale for requiring single-bit error correction on all caches? Also please clarify why the spec dont allow for correcting multi-bit errors; based on the SDC goals some caches may need to support for e.g. triple error detection and double error correction..
The rationale was to have a minimal set of RAS requirements until we
have a proper RISC-V RAS spec that we can refer to. Hence, having a
single-bit error correction was a minimalistic requirement. It is not
a platform spec violation to have the ability to correct multi-bit
errors.
The current wording is the following.
All cache structures must be protected.
single-bit errors must be detected and corrected.
multi-bit errors can be detected and reported.
Platforms are free to implement more advanced features than the
minimalistic requirements that are mandated here. So we should be OK.
Agree?
The firmware-first model seems to require a configuration per RAS-event/RAS-error-source to trigger firmware-first or OS-first. There may be hundreds of RAS events/errors that a platform supports. Why is it required to support per-event/per-error selectivity vs. a two level selection where all RAS errors are either handled by firmware-first or handled by OS-first.
Yes, there may be hundreds of RAS errors/events. The intent here was
that at the lowest level of granularity, we should be able to
selectively route each of these to the respective software/firmware
entity. So yes, we could add additional gates on top like the two
level selection you have suggested but the platform spec is simply
conveying the expected support at the lowest level.
The current wording is "The platform should provide the capability to
configure each RAS error to trigger firmware-first or OS-first error
interrupt".
Will this suffice or do we need to add more clarity?
regards
ved
--
Regards
Kumar