Greeting all,
I would like to start a discussion on supporting QoS capabilities in RISC-V architecture. I hope I am posting on the right list/TG/HC.
First, a short background: Quality of Service (QoS) is the minimal end-to-end performance that is guaranteed in advance by a service level agreement (SLA) to an application. The performance may be measured in the form of metrics such as instructions per cycle (IPC), latency of servicing work, etc.
Various factors such as the available cache capacity, memory bandwidth, interconnect bandwidth, CPU cycles, system memory, etc. affect the performance in a computing system that runs multiple applications concurrently. Further when there is arbitration required for shared resources, the prioritization of the applications requests against other competing requests may also affect the performance of the application.
When multiple applications are running concurrently on modern processors with large core counts, multiple cache hierarchies, and multiple memory controllers, the performance of an application becomes less deterministic or even non-deterministic as the performance depends on the behavior of all the other applications in the machine that contend for the shared resources leading to interference. In many deployment scenarios such as public cloud servers the application owner may not be in control of the type and placement of other applications in the platform.
A typical use model involves profiling the resource usage of the application to meet desired performance goals and to establish resource allocations/limits for the application to acheive those goals.
System software can control some of these resources available to the application such as the number of hardware threads made available for execution, the amount of system memory allocated to the applications, the number of CPU cycles provided for execution, etc. but presently lacks the capabilities to control interference to an application and thereby reduce variability in performance experienced by an application due to other applications use of capacity, memory bandwidth, interconnect bandwidth, etc.
Some thoughts on supporting such capability: 1. To provide differentiated services in the platform a CSR may be provided to associate an identifier with a application (e.g. process, VM, container, etc). This identifier is then associated with requests to access to the shared resources such as caches, interconnects, memory, etc. 2. Configuration registers and counters are needed in resource controllers e.g. memory, cache, interconnect, etc. to setup resource allocations and monitor resource usage. The controllers may use the identifiers associated with requests to enforce the configured resource allocations and/or monitor the resource consumption.
Please share your comments and feedback. If there is WIP already please point me to that.
regards ved
|
|

mark
Vedvyas,
Thank you.
We have a RAS committee on the org and approved by the BOD but has not been formed and QOS is one part of what it was intended to look at (as part of availability).
I wonder if we can't use this as an opportunity to initiate this committee. Once it has strategy,gaps, and priorities (through itself for a SIG), the idea is the committee work with Priv to create a TG.
We would need an acting committee chair to drive this. Policy here.
Mark
toggle quoted message
Show quoted text
On Wed, Nov 10, 2021 at 6:11 AM Vedvyas Shanbhogue < ved@...> wrote: Greeting all,
I would like to start a discussion on supporting QoS capabilities in RISC-V architecture. I hope I am posting on the right list/TG/HC.
First, a short background:
Quality of Service (QoS) is the minimal end-to-end performance that is guaranteed in advance by a service level agreement (SLA) to an application. The performance may be measured in the form of metrics such as instructions per cycle (IPC), latency of servicing work, etc.
Various factors such as the available cache capacity, memory bandwidth, interconnect bandwidth, CPU cycles, system memory, etc. affect the performance in a computing system that runs multiple applications concurrently. Further when there is arbitration required for shared resources, the prioritization of the applications requests against other competing requests may also affect the performance of the application.
When multiple applications are running concurrently on modern processors with large core counts, multiple cache hierarchies, and multiple memory controllers, the performance of an application becomes less deterministic or even non-deterministic as the performance depends on the behavior of all the other applications in the machine that contend for the shared resources leading to interference. In many deployment scenarios such as public cloud servers the application owner may not be in control of the type and placement of other applications in the platform.
A typical use model involves profiling the resource usage of the application to meet desired performance goals and to establish resource allocations/limits for the application to acheive those goals.
System software can control some of these resources available to the application such as the number of hardware threads made available for execution, the amount of system memory allocated to the applications, the number of CPU cycles provided for execution, etc. but presently lacks the capabilities to control interference to an application and thereby reduce variability in performance experienced by an application due to other applications use of capacity, memory bandwidth, interconnect bandwidth, etc.
Some thoughts on supporting such capability:
1. To provide differentiated services in the platform a CSR may be provided to associate an identifier with a application (e.g. process, VM, container, etc). This identifier is then associated with requests to access to the shared resources such as caches, interconnects, memory, etc.
2. Configuration registers and counters are needed in resource controllers e.g. memory, cache, interconnect, etc. to setup resource allocations and monitor resource usage. The controllers may use the identifiers associated with requests to enforce the configured resource allocations and/or monitor the resource consumption.
Please share your comments and feedback. If there is WIP already please point me to that.
regards
ved
|
|

Allen Baum
There is already a process identifier defined in the architecture (ASID) though it is local and not global across a system. I vaguely remember that the IOMMU and/or IOPMP proposals make use of something similar. Leveraging off those proposals would seem to be desirable if they ift.
toggle quoted message
Show quoted text
Vedvyas,
Thank you.
We have a RAS committee on the org and approved by the BOD but has not been formed and QOS is one part of what it was intended to look at (as part of availability).
I wonder if we can't use this as an opportunity to initiate this committee. Once it has strategy,gaps, and priorities (through itself for a SIG), the idea is the committee work with Priv to create a TG.
We would need an acting committee chair to drive this. Policy here.
Mark
On Wed, Nov 10, 2021 at 6:11 AM Vedvyas Shanbhogue < ved@...> wrote: Greeting all,
I would like to start a discussion on supporting QoS capabilities in RISC-V architecture. I hope I am posting on the right list/TG/HC.
First, a short background:
Quality of Service (QoS) is the minimal end-to-end performance that is guaranteed in advance by a service level agreement (SLA) to an application. The performance may be measured in the form of metrics such as instructions per cycle (IPC), latency of servicing work, etc.
Various factors such as the available cache capacity, memory bandwidth, interconnect bandwidth, CPU cycles, system memory, etc. affect the performance in a computing system that runs multiple applications concurrently. Further when there is arbitration required for shared resources, the prioritization of the applications requests against other competing requests may also affect the performance of the application.
When multiple applications are running concurrently on modern processors with large core counts, multiple cache hierarchies, and multiple memory controllers, the performance of an application becomes less deterministic or even non-deterministic as the performance depends on the behavior of all the other applications in the machine that contend for the shared resources leading to interference. In many deployment scenarios such as public cloud servers the application owner may not be in control of the type and placement of other applications in the platform.
A typical use model involves profiling the resource usage of the application to meet desired performance goals and to establish resource allocations/limits for the application to acheive those goals.
System software can control some of these resources available to the application such as the number of hardware threads made available for execution, the amount of system memory allocated to the applications, the number of CPU cycles provided for execution, etc. but presently lacks the capabilities to control interference to an application and thereby reduce variability in performance experienced by an application due to other applications use of capacity, memory bandwidth, interconnect bandwidth, etc.
Some thoughts on supporting such capability:
1. To provide differentiated services in the platform a CSR may be provided to associate an identifier with a application (e.g. process, VM, container, etc). This identifier is then associated with requests to access to the shared resources such as caches, interconnects, memory, etc.
2. Configuration registers and counters are needed in resource controllers e.g. memory, cache, interconnect, etc. to setup resource allocations and monitor resource usage. The controllers may use the identifiers associated with requests to enforce the configured resource allocations and/or monitor the resource consumption.
Please share your comments and feedback. If there is WIP already please point me to that.
regards
ved
|
|
A good paper on this topic is "Per-Thread Cycle Accounting in Multicore Processors"
toggle quoted message
Show quoted text
On Wed, Nov 10, 2021 at 6:11 AM Vedvyas Shanbhogue < ved@...> wrote: Greeting all,
I would like to start a discussion on supporting QoS capabilities in RISC-V architecture. I hope I am posting on the right list/TG/HC.
First, a short background:
Quality of Service (QoS) is the minimal end-to-end performance that is guaranteed in advance by a service level agreement (SLA) to an application. The performance may be measured in the form of metrics such as instructions per cycle (IPC), latency of servicing work, etc.
Various factors such as the available cache capacity, memory bandwidth, interconnect bandwidth, CPU cycles, system memory, etc. affect the performance in a computing system that runs multiple applications concurrently. Further when there is arbitration required for shared resources, the prioritization of the applications requests against other competing requests may also affect the performance of the application.
When multiple applications are running concurrently on modern processors with large core counts, multiple cache hierarchies, and multiple memory controllers, the performance of an application becomes less deterministic or even non-deterministic as the performance depends on the behavior of all the other applications in the machine that contend for the shared resources leading to interference. In many deployment scenarios such as public cloud servers the application owner may not be in control of the type and placement of other applications in the platform.
A typical use model involves profiling the resource usage of the application to meet desired performance goals and to establish resource allocations/limits for the application to acheive those goals.
System software can control some of these resources available to the application such as the number of hardware threads made available for execution, the amount of system memory allocated to the applications, the number of CPU cycles provided for execution, etc. but presently lacks the capabilities to control interference to an application and thereby reduce variability in performance experienced by an application due to other applications use of capacity, memory bandwidth, interconnect bandwidth, etc.
Some thoughts on supporting such capability:
1. To provide differentiated services in the platform a CSR may be provided to associate an identifier with a application (e.g. process, VM, container, etc). This identifier is then associated with requests to access to the shared resources such as caches, interconnects, memory, etc.
2. Configuration registers and counters are needed in resource controllers e.g. memory, cache, interconnect, etc. to setup resource allocations and monitor resource usage. The controllers may use the identifiers associated with requests to enforce the configured resource allocations and/or monitor the resource consumption.
Please share your comments and feedback. If there is WIP already please point me to that.
regards
ved
|
|
The scontext and hcontext registers from the Debug spec also serve a similar function as process/context identifier. Maybe there is some synergy there.
|
|
On Wed, Nov 10, 2021 at 07:06:41AM -0800, Allen Baum wrote: There is already a process identifier defined in the architecture (ASID) though it is local and not global across a system. I vaguely remember that the IOMMU and/or IOPMP proposals make use of something similar. Leveraging off those proposals would seem to be desirable if they ift.
As you rightly pointed out monitoring or allocation of resources requires a way to identify the originator of the request. Traditionally, as the request proceeds downstream through the network of resources, there is no way to associate the request with a specific application or group of applications. In some usages, in addition to providing differentiated service among applications, the ability to differentiate between resource usage for code execution and for data accesses of the same application may be required. Presently ASID is defined to be private to a hart. This was clarified in version 1.11 of the privileged specification but there was commentary added about possibility of a future global-ASID. However, for QoS purposes the ASID may not lend itself as well as an identifier. The system may want to group multiple applications/virtual-machines/containers into a resource control group. Further the ASID does not help differentiate between code execution vs. data access. One way that could have been addressed is to carry a code/data indicator along with the request but that may create some inefficiencies sicne in the resource controllers now there will be two sets of controls/counters per ID (one for code and other for data), but when differentiated service for code vs. data is not required it may lead to the per-ID code counters/controls to be not used. To support grouping a lookup table may be employed in hardware to group multiple ASIDs together but it increases hardware complexity especially for high speed implementations to have a lookup table accessed on each request. So we may want to keep the hardware simpler and let the grouping be done by software. So to support QoS we may want to provide a mechanism by which an application can be associated with a resource control ID (RCID) and a monitoring counter ID (MCID) that accompany each request made by the application. We would also want a mechanism to associate these IDs with request made by a device on behalf of the application. Here the term application is used generically to refer to a process or a VM or a container or other abstractions employed by the system for resource control. An application would be associated with one RCID and one MCID that accompany its requests for data accesses and a potentially diffferent RCID and MCID that accompany its requests for code accesses. Data accesses include requests generated by load and store instructions as well as the implicit loads and stores to the first-stage and second-stage page tables. Where differentiated QoS for code vs. data is not required, the code and data RCID and MCID may be programmed to be the same. A group of applications may be associated with the same RCID and one or more of these applications may be associated with a unique MCID for code and/or data. This allows measuring the resource consumption of a subset of applications that share a RCID to determine if the resource partitioning is optimal and to make adjustments as needed. The RCID and MCID would want to have a global scope across all caches, interconnect, and memory controllers that a request may access. To support maximum flexibility, the RCID and MCID may be defined to be up to 16-bits wide but could be limited to more reasonable numbers by an implementation e.g. 64 or 128 resource control IDs. These IDs may thus be programmed into a set of CSRs (one each for M/S/VS mode) where each CSR is 64 bit wide holding the RCID and MCID for code and data accesses respectively. For device initiated accesses these IDs could be programmed into the IOMMU such that the IOMMU. Other implementations may support directly configuring these IDs into the devices themself. regards ved On Wed, Nov 10, 2021 at 6:40 AM mark <markhimelstein@...> wrote:
Vedvyas,
Thank you.
We have a RAS committee on the org and approved by the BOD but has not been formed and QOS is one part of what it was intended to look at (as part of availability).
I wonder if we can't use this as an opportunity to initiate this committee. Once it has strategy,gaps, and priorities (through itself for a SIG), the idea is the committee work with Priv to create a TG.
We would need an acting committee chair to drive this. Policy here <https://docs.google.com/document/d/14ZpciYwIzmuiB92_hKfwTAttTnc3rsLbWI-CpC7MdC8/edit?usp=sharing> .
Mark
On Wed, Nov 10, 2021 at 6:11 AM Vedvyas Shanbhogue <ved@...> wrote:
Greeting all,
I would like to start a discussion on supporting QoS capabilities in RISC-V architecture. I hope I am posting on the right list/TG/HC.
First, a short background: Quality of Service (QoS) is the minimal end-to-end performance that is guaranteed in advance by a service level agreement (SLA) to an application. The performance may be measured in the form of metrics such as instructions per cycle (IPC), latency of servicing work, etc.
Various factors such as the available cache capacity, memory bandwidth, interconnect bandwidth, CPU cycles, system memory, etc. affect the performance in a computing system that runs multiple applications concurrently. Further when there is arbitration required for shared resources, the prioritization of the applications requests against other competing requests may also affect the performance of the application.
When multiple applications are running concurrently on modern processors with large core counts, multiple cache hierarchies, and multiple memory controllers, the performance of an application becomes less deterministic or even non-deterministic as the performance depends on the behavior of all the other applications in the machine that contend for the shared resources leading to interference. In many deployment scenarios such as public cloud servers the application owner may not be in control of the type and placement of other applications in the platform.
A typical use model involves profiling the resource usage of the application to meet desired performance goals and to establish resource allocations/limits for the application to acheive those goals.
System software can control some of these resources available to the application such as the number of hardware threads made available for execution, the amount of system memory allocated to the applications, the number of CPU cycles provided for execution, etc. but presently lacks the capabilities to control interference to an application and thereby reduce variability in performance experienced by an application due to other applications use of capacity, memory bandwidth, interconnect bandwidth, etc.
Some thoughts on supporting such capability: 1. To provide differentiated services in the platform a CSR may be provided to associate an identifier with a application (e.g. process, VM, container, etc). This identifier is then associated with requests to access to the shared resources such as caches, interconnects, memory, etc. 2. Configuration registers and counters are needed in resource controllers e.g. memory, cache, interconnect, etc. to setup resource allocations and monitor resource usage. The controllers may use the identifiers associated with requests to enforce the configured resource allocations and/or monitor the resource consumption.
Please share your comments and feedback. If there is WIP already please point me to that.
regards ved
|
|
On Wed, Nov 10, 2021 at 06:40:19AM -0800, Mark Himelstein wrote: We have a RAS committee on the org and approved by the BOD but has not been formed and QOS is one part of what it was intended to look at (as part of availability).
I personally equated availability to tolerance of errors i.e. mechanism that would help minimize the impact of an error and avoid unplanned downtime as much as possible. These mechanisms may provide failure prediction (e.g., corrected errors exceeding a threshold). These mechanism may provide error isolation techniques (e.g, precise containment of corrupt data) to minimize the blast radius of an error to the smallest possible components without causing extensive disruption to the rest of the system. Though I personally did not think of QoS as being an availability mechanism I am certainly open to a more wider/different definition than what I equated to above. I wonder if we can't use this as an opportunity to initiate this committee. Once it has strategy,gaps, and priorities (through itself for a SIG), the idea is the committee work with Priv to create a TG.
I think that would be a good idea and I look forward to contribute to this effort. Look forward to your guidance and suggestion on how (sorry, I am n00b on the process front). regards ved
|
|
Continuing this thread with some more thoughts included. regards ved On Fri, Nov 12, 2021 at 08:19:01AM -0600, Vedvyas Shanbhogue via lists.riscv.org wrote: Presently ASID is defined to be private to a hart. This was clarified in version 1.11 of the privileged specification but there was commentary added about possibility of a future global-ASID. However, for QoS purposes the ASID may not lend itself as well as an identifier. The system may want to group multiple applications/virtual-machines/containers into a resource control group. Further the ASID does not help differentiate between code execution vs. data access. One way that could have been addressed is to carry a code/data indicator along with the request but that may create some inefficiencies sicne in the resource controllers now there will be two sets of controls/counters per ID (one for code and other for data), but when differentiated service for code vs. data is not required it may lead to the per-ID code counters/controls to be not used. To support grouping a lookup table may be employed in hardware to group multiple ASIDs together but it increases hardware complexity especially for high speed implementations to have a lookup table accessed on each request. So we may want to keep the hardware simpler and let the grouping be done by software.
So to support QoS we may want to provide a mechanism by which an application can be associated with a resource control ID (RCID) and a monitoring counter ID (MCID) that accompany each request made by the application. We would also want a mechanism to associate these IDs with request made by a device on behalf of the application. Here the term application is used generically to refer to a process or a VM or a container or other abstractions employed by the system for resource control.
An application would be associated with one RCID and one MCID that accompany its requests for data accesses and a potentially diffferent RCID and MCID that accompany its requests for code accesses. Data accesses include requests generated by load and store instructions as well as the implicit loads and stores to the first-stage and second-stage page tables. Where differentiated QoS for code vs. data is not required, the code and data RCID and MCID may be programmed to be the same.
A group of applications may be associated with the same RCID and one or more of these applications may be associated with a unique MCID for code and/or data. This allows measuring the resource consumption of a subset of applications that share a RCID to determine if the resource partitioning is optimal and to make adjustments as needed.
The RCID and MCID would want to have a global scope across all caches, interconnect, and memory controllers that a request may access. To support maximum flexibility, the RCID and MCID may be defined to be up to 16-bits wide but could be limited to more reasonable numbers by an implementation e.g. 64 or 128 resource control IDs.
These IDs may thus be programmed into a set of CSRs (one each for M/S/VS mode) where each CSR is 64 bit wide holding the RCID and MCID for code and data accesses respectively. For device initiated accesses these IDs could be programmed into the IOMMU such that the IOMMU. Other implementations may support directly configuring these IDs into the devices themself.
Quality of service enforcement in caches: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Caches that support the QoS extension allow the cache capacity to be allocated to applications and provide mechanisms to monitor the cache usage by the applications. The granularity of allocation is 1/MaxCacheBlocks where MaxCacheBlocks is a property of the cache controller. A cache that supports this extension, defines the number of blocks supported by the cache. A cache block mask may then be configured in the cache controller, for each supported RCID, where each bit of the mask corresponds to a cache block. All cache lookups scan the entire cache to determine if the requested line is present. If the requested cache line is not found then a cache line may be allocated from the set of cache blocks selected by the RCID. If allocating a line requires an eviction of a previously allocated cache line then the eviction candidate is obtained from the set of cache blocks selected by the RCID. The cache controller implements a monitoring counter per RCID and the counter can be programmed with a monitoring event ID that selects an event to count for requests with matching RCID. One such event ID would be to count the number of cache lines allocated and resident in the cache by requests with the matching RCID. Some events counted by the cache controller may not be precise but are expected to be statistically accurate over a reasonable monitoring period. When a monitoring counter is enabled, the count held in the counters may not be accurate till an implementation-defined number of requests have been observed by the cache controller. The controller provides a validity indiation to indicate when the count is valid. Quality of service enforcement in interconnects and memory controllers: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The interconnect and memory controller capacity i.e. bandwidth allocation enables restricting the bandwidth consumed by an application to a programmed limit. The bandwidth allocation is represented as a ratio of the maximum available bandwidth. The granularity of allocation is 1/MaxBWBlocks where MaxBWBlocks is a power of 10 with the smallest value of 100 (e.g., 100, 1000, or 10000). The MaxBWBlocks is a property of the interconnect or memory bandwidth controller. Allocating bandwidth to an RCID involves configuring: - A guaranteed bandwidth - Gbw - A maximum bandwidth - Mbw - Priority - Mprio - high, medium, or low The Gbw is the minimum bandwidth in units of bandwidth blocks that is reserved for the RCID and must be at least one. The sum of Gbw across all RCID must not exceed MaxGBWBlocks. The MaxGBWBlocks is a property of the interconnect or memory controller. In some implementations it may be the same as MaxBWBlocks. Other implementations may limit to a fraction (e.g. 90%) of MaxBWBlocks. The Mbw is the maximum bandwidth in units of bandwidth blocks that the RCID may consume. If Mbw is <= Gbw then Mbw does not constrain the bandwidth usage. If Mbw is > Gbw the bandwidth beyond Gbw is not guaranteed and actual bandwidth available may depend on the priority - Mprio - of the RCID that contend for the non-guaranteed bandwidth. To enforce these limits, the controller needs to meter the bandwidth. The bandwidth metering involves counting bytes transferred (B), in both directions, over a time interval (T) to determine the bandwidth B / T. The physical manifestations of such meters would be outside the scope of this specification. Implementation may use discrete time intervals to count bytes such that no history is preserved from one time interval to the next. In such implementations, the counter B is reset at the start of each time interval. Other implementations may use a sliding time interval where in the start of the time interval advances at an uniform rate. In such a sliding time interval scheme, the counter B increments on each request and decreases by the number of bytes of older requests that are no longer in the time interval. Such a scheme may require carrying a history of requests received in any interval T. If there is contention for bandwidth then requests from RCID that have not consumed their Gbw have priority irrespective of the Mprio configured for the RCID. Requesters that have consumed their Gbw contend with other requesters for the best effort available bandwidth till they have consumed Mbw. The contention for the non-guaranteed bandwidth is resolved using Mprio. The proportion of excess bandwidth that may be allocated to each Mprio class is configurable in the form of a configurable weight associated with each priority level. The bandwidth controllers implement a monitoring counter for each MCID. The bandwidth monitoring counter reports the bytes that go past the monitoring point in the bandwidth controller. The bandwidth controller provides a mechanism to obtain a snapshot of the counter value and a timestamp at which the snapshot was taken. The timestamp shall be based on a timer that increments at the same rate as the clock used to provide timestamp on reading time CSR. By computing the difference between the byte counter values from two snapshots separated in time and by computing the difference between the timestamp of the two snapshots the bandwidth consumed by the MCID in that interval can be determined. Each counter can be programmed with a monitoring event ID such as “local read bandwidth”, “local write bandwidth”, “local read and write bandwidth”, “remote read bandwidth”, “remote write bandwidth”, “remote read and write bandwidth”, “total read bandwidth”, “total write bandwidth”, or “total read and write bandwidth” to select the event to count. When the event ID selects read bandwidth, the counter increments by the number of bytes transferred in response to a read request. When the event ID selects write bandwidth, the counter increments by the number of bytes transferred by a write request. The distinction of local vs. remote exists for non-uniform memory architectures where local bandwidth is the bandwidth consumed by the MCID when it accesses resources in its NUMA domain and remote bandwidth is bandwidth consumed accessing resources outside NUMA domain. The distinction of local vs. remote may not exist in some bandwidth controllers and such controllers may only support monitoring of total read and/or write bandwidth. Configuration interface ~~~~~~~~~~~~~~~~~~~~~~~ The configuration interface may be through a set of memory-mapped registers in each cache, interconnect, and memory controller. A cache controller would provide registers for: - Configuring the cache block allocations for an RCID - COnfiguring a monitoring event for an RMID - Registers to read the monitoring counters A bandwidth controller would provide registers for: - Configuring minimum b/w, guaranteed b/w and priority for an RCID - Configuring a monitoring event for a RMID - Registers to read the monitoring counters
|
|
toggle quoted message
Show quoted text
On Fri, Dec 3, 2021 at 8:39 AM Ved Shanbhogue <ved@...> wrote: Continuing this thread with some more thoughts included.
regards ved
On Fri, Nov 12, 2021 at 08:19:01AM -0600, Vedvyas Shanbhogue via lists.riscv.org wrote:
Presently ASID is defined to be private to a hart. This was clarified in version 1.11 of the privileged specification but there was commentary added about possibility of a future global-ASID. However, for QoS purposes the ASID may not lend itself as well as an identifier. The system may want to group multiple applications/virtual-machines/containers into a resource control group. Further the ASID does not help differentiate between code execution vs. data access. One way that could have been addressed is to carry a code/data indicator along with the request but that may create some inefficiencies sicne in the resource controllers now there will be two sets of controls/counters per ID (one for code and other for data), but when differentiated service for code vs. data is not required it may lead to the per-ID code counters/controls to be not used. To support grouping a lookup table may be employed in hardware to group multiple ASIDs together but it increases hardware complexity especially for high speed implementations to have a lookup table accessed on each request. So we may want to keep the hardware simpler and let the grouping be done by software.
So to support QoS we may want to provide a mechanism by which an application can be associated with a resource control ID (RCID) and a monitoring counter ID (MCID) that accompany each request made by the application. We would also want a mechanism to associate these IDs with request made by a device on behalf of the application. Here the term application is used generically to refer to a process or a VM or a container or other abstractions employed by the system for resource control.
An application would be associated with one RCID and one MCID that accompany its requests for data accesses and a potentially diffferent RCID and MCID that accompany its requests for code accesses. Data accesses include requests generated by load and store instructions as well as the implicit loads and stores to the first-stage and second-stage page tables. Where differentiated QoS for code vs. data is not required, the code and data RCID and MCID may be programmed to be the same.
A group of applications may be associated with the same RCID and one or more of these applications may be associated with a unique MCID for code and/or data. This allows measuring the resource consumption of a subset of applications that share a RCID to determine if the resource partitioning is optimal and to make adjustments as needed.
The RCID and MCID would want to have a global scope across all caches, interconnect, and memory controllers that a request may access. To support maximum flexibility, the RCID and MCID may be defined to be up to 16-bits wide but could be limited to more reasonable numbers by an implementation e.g. 64 or 128 resource control IDs.
These IDs may thus be programmed into a set of CSRs (one each for M/S/VS mode) where each CSR is 64 bit wide holding the RCID and MCID for code and data accesses respectively. For device initiated accesses these IDs could be programmed into the IOMMU such that the IOMMU. Other implementations may support directly configuring these IDs into the devices themself.
Quality of service enforcement in caches: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Caches that support the QoS extension allow the cache capacity to be allocated to applications and provide mechanisms to monitor the cache usage by the applications. The granularity of allocation is 1/MaxCacheBlocks where MaxCacheBlocks is a property of the cache controller. A cache that supports this extension, defines the number of blocks supported by the cache. A cache block mask may then be configured in the cache controller, for each supported RCID, where each bit of the mask corresponds to a cache block. All cache lookups scan the entire cache to determine if the requested line is present. If the requested cache line is not found then a cache line may be allocated from the set of cache blocks selected by the RCID. If allocating a line requires an eviction of a previously allocated cache line then the eviction candidate is obtained from the set of cache blocks selected by the RCID.
The cache controller implements a monitoring counter per RCID and the counter can be programmed with a monitoring event ID that selects an event to count for requests with matching RCID. One such event ID would be to count the number of cache lines allocated and resident in the cache by requests with the matching RCID. Some events counted by the cache controller may not be precise but are expected to be statistically accurate over a reasonable monitoring period. When a monitoring counter is enabled, the count held in the counters may not be accurate till an implementation-defined number of requests have been observed by the cache controller. The controller provides a validity indiation to indicate when the count is valid.
Quality of service enforcement in interconnects and memory controllers: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The interconnect and memory controller capacity i.e. bandwidth allocation enables restricting the bandwidth consumed by an application to a programmed limit. The bandwidth allocation is represented as a ratio of the maximum available bandwidth. The granularity of allocation is 1/MaxBWBlocks where MaxBWBlocks is a power of 10 with the smallest value of 100 (e.g., 100, 1000, or 10000). The MaxBWBlocks is a property of the interconnect or memory bandwidth controller. Allocating bandwidth to an RCID involves configuring: - A guaranteed bandwidth - Gbw - A maximum bandwidth - Mbw - Priority - Mprio - high, medium, or low
The Gbw is the minimum bandwidth in units of bandwidth blocks that is reserved for the RCID and must be at least one. The sum of Gbw across all RCID must not exceed MaxGBWBlocks. The MaxGBWBlocks is a property of the interconnect or memory controller. In some implementations it may be the same as MaxBWBlocks. Other implementations may limit to a fraction (e.g. 90%) of MaxBWBlocks. The Mbw is the maximum bandwidth in units of bandwidth blocks that the RCID may consume. If Mbw is <= Gbw then Mbw does not constrain the bandwidth usage. If Mbw is > Gbw the bandwidth beyond Gbw is not guaranteed and actual bandwidth available may depend on the priority - Mprio - of the RCID that contend for the non-guaranteed bandwidth. To enforce these limits, the controller needs to meter the bandwidth. The bandwidth metering involves counting bytes transferred (B), in both directions, over a time interval (T) to determine the bandwidth B / T.
The physical manifestations of such meters would be outside the scope of this specification. Implementation may use discrete time intervals to count bytes such that no history is preserved from one time interval to the next. In such implementations, the counter B is reset at the start of each time interval. Other implementations may use a sliding time interval where in the start of the time interval advances at an uniform rate. In such a sliding time interval scheme, the counter B increments on each request and decreases by the number of bytes of older requests that are no longer in the time interval. Such a scheme may require carrying a history of requests received in any interval T.
If there is contention for bandwidth then requests from RCID that have not consumed their Gbw have priority irrespective of the Mprio configured for the RCID. Requesters that have consumed their Gbw contend with other requesters for the best effort available bandwidth till they have consumed Mbw. The contention for the non-guaranteed bandwidth is resolved using Mprio. The proportion of excess bandwidth that may be allocated to each Mprio class is configurable in the form of a configurable weight associated with each priority level.
The bandwidth controllers implement a monitoring counter for each MCID. The bandwidth monitoring counter reports the bytes that go past the monitoring point in the bandwidth controller. The bandwidth controller provides a mechanism to obtain a snapshot of the counter value and a timestamp at which the snapshot was taken. The timestamp shall be based on a timer that increments at the same rate as the clock used to provide timestamp on reading time CSR. By computing the difference between the byte counter values from two snapshots separated in time and by computing the difference between the timestamp of the two snapshots the bandwidth consumed by the MCID in that interval can be determined. Each counter can be programmed with a monitoring event ID such as “local read bandwidth”, “local write bandwidth”, “local read and write bandwidth”, “remote read bandwidth”, “remote write bandwidth”, “remote read and write bandwidth”, “total read bandwidth”, “total write bandwidth”, or “total read and write bandwidth” to select the event to count. When the event ID selects read bandwidth, the counter increments by the number of bytes transferred in response to a read request. When the event ID selects write bandwidth, the counter increments by the number of bytes transferred by a write request. The distinction of local vs. remote exists for non-uniform memory architectures where local bandwidth is the bandwidth consumed by the MCID when it accesses resources in its NUMA domain and remote bandwidth is bandwidth consumed accessing resources outside NUMA domain. The distinction of local vs. remote may not exist in some bandwidth controllers and such controllers may only support monitoring of total read and/or write bandwidth.
Configuration interface ~~~~~~~~~~~~~~~~~~~~~~~ The configuration interface may be through a set of memory-mapped registers in each cache, interconnect, and memory controller.
A cache controller would provide registers for: - Configuring the cache block allocations for an RCID - COnfiguring a monitoring event for an RMID - Registers to read the monitoring counters
A bandwidth controller would provide registers for: - Configuring minimum b/w, guaranteed b/w and priority for an RCID - Configuring a monitoring event for a RMID - Registers to read the monitoring counters
|
|
Jonathan Behrens <behrensj@...>
I only skimmed some of the proposal, but one thing I noticed is that there doesn't seem to be much limit over who can set the current RCID and MCID. In particular, with the H-extension it looks like a guest operating system can freely set its own IDs. That would for instance mean that a cloud provider that ran multiple customer VMs couldn't use this to monitor or limit resource usage of individual VMs.
Jonathan
toggle quoted message
Show quoted text
Greetings All!
So finally collected these thoughts and put them into a document:
https://docs.google.com/document/d/1SfvV0oJHiRa89K5IZkzWL-FtXsAGhOPYrMF9B1-1vp8/edit?usp=sharing
The document also has some thoughts on a configuration interface for
cache and bandwidth controllers.
Look forward to feedback and comments.
regards
ved
On Fri, Dec 3, 2021 at 8:39 AM Ved Shanbhogue <ved@...> wrote:
>
> Continuing this thread with some more thoughts included.
>
> regards
> ved
>
> On Fri, Nov 12, 2021 at 08:19:01AM -0600, Vedvyas Shanbhogue via lists.riscv.org wrote:
> >Presently ASID is defined to be private to a hart. This was clarified in version 1.11 of the privileged specification but there was commentary added about possibility of a future global-ASID. However, for QoS purposes the ASID may not lend itself as well as an identifier. The system may want to group multiple applications/virtual-machines/containers into a resource control group. Further the ASID does not help differentiate between code execution vs. data access. One way that could have been addressed is to carry a code/data indicator along with the request but that may create some inefficiencies sicne in the resource controllers now there will be two sets of controls/counters per ID (one for code and other for data), but when differentiated service for code vs. data is not required it may lead to the per-ID code counters/controls to be not used. To support grouping a lookup table may be employed in hardware to group multiple ASIDs together but it increases hardware complexity especially for high speed implementations to have a lookup table accessed on each request. So we may want to keep the hardware simpler and let the grouping be done by software.
> >
> >So to support QoS we may want to provide a mechanism by which an application can be associated with a resource control ID (RCID) and a monitoring counter ID (MCID) that accompany each request made by the application. We would also want a mechanism to associate these IDs with request made by a device on behalf of the application. Here the term application is used generically to refer to a process or a VM or a container or other abstractions employed by the system for resource control.
> >
> >An application would be associated with one RCID and one MCID that
> >accompany its requests for data accesses and a potentially diffferent
> >RCID and MCID that accompany its requests for code accesses. Data
> >accesses include requests generated by load and store instructions as
> >well as the implicit loads and stores to the first-stage and
> >second-stage page tables. Where differentiated QoS for code vs. data
> >is not required, the code and data RCID and MCID may be programmed to
> >be the same.
> >
> >A group of applications may be associated with the same RCID and one or more of these applications may be associated with a unique MCID for code and/or data. This allows measuring the resource consumption of a subset of applications that share a RCID to determine if the resource partitioning is optimal and to make adjustments as needed.
> >
> >The RCID and MCID would want to have a global scope across all caches, interconnect, and memory controllers that a request may access. To support maximum flexibility, the RCID and MCID may be defined to be up to 16-bits wide but could be limited to more reasonable numbers by an implementation e.g. 64 or 128 resource control IDs.
> >
> >These IDs may thus be programmed into a set of CSRs (one each for M/S/VS mode) where each CSR is 64 bit wide holding the RCID and MCID for code and data accesses respectively. For device initiated accesses these IDs could be programmed into the IOMMU such that the IOMMU. Other implementations may support directly configuring these IDs into the devices themself.
> >
>
> Quality of service enforcement in caches:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Caches that support the QoS extension allow the cache capacity to be allocated to applications and provide mechanisms to monitor the cache usage by the applications. The granularity of allocation is 1/MaxCacheBlocks where MaxCacheBlocks is a property of the cache controller. A cache that supports this extension, defines the number of blocks supported by the cache. A cache block mask may then be configured in the cache controller, for each supported RCID, where each bit of the mask corresponds to a cache block. All cache lookups scan the entire cache to determine if the requested line is present. If the requested cache line is not found then a cache line may be allocated from the set of cache blocks selected by the RCID. If allocating a line requires an eviction of a previously allocated cache line then the eviction candidate is obtained from the set of cache blocks selected by the RCID.
>
> The cache controller implements a monitoring counter per RCID and the counter can be programmed with a monitoring event ID that selects an event to count for requests with matching RCID. One such event ID would be to count the number of cache lines allocated and resident in the cache by requests with the matching RCID. Some events counted by the cache controller may not be precise but are expected to be statistically accurate over a reasonable monitoring period. When a monitoring counter is enabled, the count held in the counters may not be accurate till an implementation-defined number of requests have been observed by the cache controller. The controller provides a validity indiation to indicate when the count is valid.
>
>
> Quality of service enforcement in interconnects and memory controllers:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> The interconnect and memory controller capacity i.e. bandwidth allocation enables restricting the bandwidth consumed by an application to a programmed limit. The bandwidth allocation is represented as a ratio of the maximum available bandwidth. The granularity of allocation is 1/MaxBWBlocks where MaxBWBlocks is a power of 10 with the smallest value of 100 (e.g., 100, 1000, or 10000). The MaxBWBlocks is a property of the interconnect or memory bandwidth controller. Allocating bandwidth to an RCID involves configuring:
> - A guaranteed bandwidth - Gbw
> - A maximum bandwidth - Mbw
> - Priority - Mprio - high, medium, or low
>
> The Gbw is the minimum bandwidth in units of bandwidth blocks that is reserved for the RCID and must be at least one. The sum of Gbw across all RCID must not exceed MaxGBWBlocks. The MaxGBWBlocks is a property of the interconnect or memory controller. In some implementations it may be the same as MaxBWBlocks. Other implementations may limit to a fraction (e.g. 90%) of MaxBWBlocks. The Mbw is the maximum bandwidth in units of bandwidth blocks that the RCID may consume. If Mbw is <= Gbw then Mbw does not constrain the bandwidth usage. If Mbw is > Gbw the bandwidth beyond Gbw is not guaranteed and actual bandwidth available may depend on the priority - Mprio - of the RCID that contend for the non-guaranteed bandwidth. To enforce these limits, the controller needs to meter the bandwidth. The bandwidth metering involves counting bytes transferred (B), in both directions, over a time interval (T) to determine the bandwidth B / T.
>
> The physical manifestations of such meters would be outside the scope of this specification. Implementation may use discrete time intervals to count bytes such that no history is preserved from one time interval to the next. In such implementations, the counter B is reset at the start of each time interval. Other implementations may use a sliding time interval where in the start of the time interval advances at an uniform rate. In such a sliding time interval scheme, the counter B increments on each request and decreases by the number of bytes of older requests that are no longer in the time interval. Such a scheme may require carrying a history of requests received in any interval T.
>
> If there is contention for bandwidth then requests from RCID that have not consumed their Gbw have priority irrespective of the Mprio configured for the RCID. Requesters that have consumed their Gbw contend with other requesters for the best effort available bandwidth till they have consumed Mbw. The contention for the non-guaranteed bandwidth is resolved using Mprio. The proportion of excess bandwidth that may be allocated to each Mprio class is configurable in the form of a configurable weight associated with each priority level.
>
> The bandwidth controllers implement a monitoring counter for each MCID. The bandwidth monitoring counter reports the bytes that go past the monitoring point in the bandwidth controller. The bandwidth controller provides a mechanism to obtain a snapshot of the counter value and a timestamp at which the snapshot was taken. The timestamp shall be based on a timer that increments at the same rate as the clock used to provide timestamp on reading time CSR. By computing the difference between the byte counter values from two snapshots separated in time and by computing the difference between the timestamp of the two snapshots the bandwidth consumed by the MCID in that interval can be determined. Each counter can be programmed with a monitoring event ID such as “local read bandwidth”, “local write bandwidth”, “local read and write bandwidth”, “remote read bandwidth”, “remote write bandwidth”, “remote read and write bandwidth”, “total read bandwidth”, “total write bandwidth”, or “total read and write bandwidth” to select the event to count. When the event ID selects read bandwidth, the counter increments by the number of bytes transferred in response to a read request. When the event ID selects write bandwidth, the counter increments by the number of bytes transferred by a write request. The distinction of local vs. remote exists for non-uniform memory architectures where local bandwidth is the bandwidth consumed by the MCID when it accesses resources in its NUMA domain and remote bandwidth is bandwidth consumed accessing resources outside NUMA domain. The distinction of local vs. remote may not exist in some bandwidth controllers and such controllers may only support monitoring of total read and/or write bandwidth.
>
>
> Configuration interface
> ~~~~~~~~~~~~~~~~~~~~~~~
> The configuration interface may be through a set of memory-mapped registers in each cache, interconnect, and memory controller.
>
> A cache controller would provide registers for:
> - Configuring the cache block allocations for an RCID
> - COnfiguring a monitoring event for an RMID
> - Registers to read the monitoring counters
>
> A bandwidth controller would provide registers for:
> - Configuring minimum b/w, guaranteed b/w and priority for an RCID
> - Configuring a monitoring event for a RMID
> - Registers to read the monitoring counters
|
|
On Thu, Dec 16, 2021 at 04:35:38PM -0500, Jonathan Behrens wrote: I only skimmed some of the proposal, but one thing I noticed is that there doesn't seem to be much limit over who can set the current RCID and MCID. In particular, with the H-extension it looks like a guest operating system can freely set its own IDs. That would for instance mean that a cloud provider that ran multiple customer VMs couldn't use this to monitor or limit resource usage of individual VMs.
Jonathan HI - Thanks. Good that you noted that. I realized I did not include the h-level CSR that go with the vsqoscfg CSR. The idea is to have a virtual RCID/MCID to physical RCID/MCID mapping where when a virtual RCID/MCID is written to the vsqoscfg CSRs, it indexes into the h level CSR that provides the physical RCID/MCID. This would allow the guest to switch between a subset of RCID/MCID made available to it by the VMM optionally. If this virtualization extension is not implemented or the VMM does not enable it then the sqoscfg CSR would cause a virtual instruction trap to the VMM. I will update the document in a bit. regads ved
|
|