Is it expected that there should be a watchdog timer and timeout signal per hart in the system, or is okay for there to be one timer in the system and for the timeout signal to be delivered to a specific hart?
Thanks, James
|
|
On Mon, Feb 28, 2022 at 6:18 PM James Robinson < jrobinson@...> wrote: Is it expected that there should be a watchdog timer and timeout signal per hart in the system, or is okay for there to be one timer in the system and for the timeout signal to be delivered to a specific hart?
For now (this year) RVI is focusing on standardizing an initial OS-A SEE (Supervisor Execution Environment) and an OS-A Platform standardizing Supervisor and User level functionality, i.e. not Machine-level functionality. While that doesn't rule out incorporating some form of Supervisor-level watchdog standardization into these specs, I think (?) the current thoughts are not focused on doing so.
FYI - Last year there was an initial proposal for standard hardware watchdog functionality, and then later a proposal instead for an SBI API (e.g. a call to tickle the supervisor watchdog, and a callback on a first-stage timeout).
But certainly speak up with your own arguments or justifications for having and standardizing supervisor watchdog functionality. (Note: ARM SBSA - for server and high-end embedded class systems - defined and required the equivalent of S-mode (aka Non-Secure) and M-mode (aka Secure) two-stage watchdog functionality.)
Aaron (acting chair of the OS-A SEE TG) and others in the OS-A SEE group, what do you think? Should some form of support for Supervisor software tickling a watchdog through some form of standardized hardware (e.g. memory-mapped registers) or software (e.g. SBI) interface be included in the OS-A SEE spec?
Greg
|
|
Hi Greg,
Thanks for your response. I'm not sure if I'm missing something about there being a connection between having a supervisor level watchdog timer and having a timer per hart, but I wasn't particularly imagining a distinction between machine and supervisor mode watch dog timers. I'll repose the question I was thinking about:
Suppose I have a system containing 16 harts. Should I have a separate WDCSR memory mapped register and associated counter for each of the 16 harts, with each counter directing an interrupt to its associated hart if it is not reset before the timeout expires? Or should I have one WDCSR memory mapped register and associated counter for the whole system, with the interrupt directed to one specific hart, and that hart being responsible for responding to a lack of timer update?
Thanks, James
|
|
On Wed, Mar 2, 2022 at 12:35 AM James Robinson < jrobinson@...> wrote: Hi Greg,
Thanks for your response. I'm not sure if I'm missing something about there being a connection between having a supervisor level watchdog timer and having a timer per hart, but I wasn't particularly imagining a distinction between machine and supervisor mode watch dog timers. I'll repose the question I was thinking about:
Suppose I have a system containing 16 harts. Should I have a separate WDCSR memory mapped register and associated counter for each of the 16 harts, with each counter directing an interrupt to its associated hart if it is not reset before the timeout expires? Or should I have one WDCSR memory mapped register and associated counter for the whole system, with the interrupt directed to one specific hart, and that hart being responsible for responding to a lack of timer update?
If one is operating the machine with 16 harts without any sharding or partitioning, I don't see why one would need a watchdog per hart. System watchdogs, or TCO timers from other architecture's parlance, are for system use. Now a core would normally have it's own watchdog for instruction retirement forward progress purposes, but that's a completely different use-case than the intention of a system level watchdog.
As for Greg's question about putting that in OS-A SEE or a Platform itself, I'm open to suggestions. However, my initial thinking is that it would be deferred to a Platform. The thinking is that OS-A SEE is about targeting SW expectations for the kernel. Kernels are really good about runtime binding of drivers based on the presence of hardware so I'm not overly inclined to mandate such things. That said, I'd be open to hear other opinions.
Thanks, James
|
|
Hi Aaron,
Thanks for the response. Would you be able to give any more details on how a core level watchdog would differ from a platform level one?
James
|
|
A core-level watchdog can mean quite different things to different people and their core designs. In some cases this "watchdog" would be a micro-architectural thing that, for example, recognizes that the core is not making forward progress and would temporarily invoke some low-performance uarch mechanism that guarantees forward progress (out of the circumstances currently causing livelock). Although the details of that very much depend on what types of livelock causes one is concerned about. In other cases this "watchdog" might generate a local interrupt to take the core into a "lack of forward progress" software handler; or a global interrupt to inform someone else that this core is livelocked.
In general, there's an enormous range of possibilities as to what a core-level watchdog means. And an enormous range as to what one is trying to accomplish or defend against.
Greg
toggle quoted message
Show quoted text
On Wed, Mar 2, 2022 at 12:09 PM James Robinson < jrobinson@...> wrote: Hi Aaron,
Thanks for the response. Would you be able to give any more details on how a core level watchdog would differ from a platform level one?
James
|
|
On Wed, Mar 2, 2022 at 1:19 PM Greg Favor < gfavor@...> wrote: A core-level watchdog can mean quite different things to different people and their core designs. In some cases this "watchdog" would be a micro-architectural thing that, for example, recognizes that the core is not making forward progress and would temporarily invoke some low-performance uarch mechanism that guarantees forward progress (out of the circumstances currently causing livelock). Although the details of that very much depend on what types of livelock causes one is concerned about. In other cases this "watchdog" might generate a local interrupt to take the core into a "lack of forward progress" software handler; or a global interrupt to inform someone else that this core is livelocked.
In general, there's an enormous range of possibilities as to what a core-level watchdog means. And an enormous range as to what one is trying to accomplish or defend against.
Yes. Greg articulated what I was getting at better than I did. I apologize for muddying the waters. From a platform standpoint one system-level watchdog should suffice as it's typically the last resort of restarting a system prior to sending a tech out.
On Wed, Mar 2, 2022 at 12:09 PM James Robinson < jrobinson@...> wrote: Hi Aaron,
Thanks for the response. Would you be able to give any more details on how a core level watchdog would differ from a platform level one?
James
|
|
On Wed, Mar 2, 2022 at 12:23 PM Aaron Durbin < adurbin@...> wrote: Yes. Greg articulated what I was getting at better than I did. I apologize for muddying the waters. From a platform standpoint one system-level watchdog should suffice as it's typically the last resort of restarting a system prior to sending a tech out.
One comment - for when any concrete discussion about having a system-level watchdog occurs:
One can have a one-stage or a two-stage watchdog. The former yanks the emergency cord on the system upon timeout.
The latter (which is what ARM defined in SBSA and the subsequent SBA) interrupts the OS on the first timeout and gives it a chance to take remedial actions (and refresh the watchdog). Then, if a second timeout occurs (without a refresh after the first timeout), the emergency cord is yanked.
ARM also defined separate Secure and Non-Secure watchdogs (akin to what one might call S-mode and M-mode watchdogs). The OS has its own watchdog to tickle and an emergency situation results in reboot of the OS (for example). And the Secure Monitor has its own watchdog and an emergency situation results in reboot of the system (for example).
Greg
|
|

Kumar Sankaran
From a platform standpoint, the intent was to have a single platform level watchdog that is shared across the entire platform. This platform watchdog could be the 2-level watchdog as described below by Greg. Whether S-mode software or M-mode software would handle the tickling of this watchdog and handle timeouts is a subject for further discussion.
toggle quoted message
Show quoted text
On Wed, Mar 2, 2022 at 12:34 PM Greg Favor <gfavor@...> wrote: On Wed, Mar 2, 2022 at 12:23 PM Aaron Durbin <adurbin@...> wrote:
Yes. Greg articulated what I was getting at better than I did. I apologize for muddying the waters. From a platform standpoint one system-level watchdog should suffice as it's typically the last resort of restarting a system prior to sending a tech out.
One comment - for when any concrete discussion about having a system-level watchdog occurs:
One can have a one-stage or a two-stage watchdog. The former yanks the emergency cord on the system upon timeout.
The latter (which is what ARM defined in SBSA and the subsequent SBA) interrupts the OS on the first timeout and gives it a chance to take remedial actions (and refresh the watchdog). Then, if a second timeout occurs (without a refresh after the first timeout), the emergency cord is yanked.
ARM also defined separate Secure and Non-Secure watchdogs (akin to what one might call S-mode and M-mode watchdogs). The OS has its own watchdog to tickle and an emergency situation results in reboot of the OS (for example). And the Secure Monitor has its own watchdog and an emergency situation results in reboot of the system (for example).
Greg
-- Regards Kumar
|
|

Allen Baum
Now we're starting to drill down appropriately. There is a wide range. This is me thinking out loud and trying desperately to avoid the real work I should be doing:
- A watchdog time event can cause an interrupt (as opposed to a HW reset) -- maskable or non-maskable? -- Using xTVEC to vector or a platform defined vector.? (e.g. the reset vector) -- A new cause type or reuse an existing one? (e.g.using the reset cause) -- restartable or non-restartable or both? (both implies - to me at least- the 2 stage watchdog concept, "pulling the emergency cord") If the watchdog timer is restartable, either it must --- be maskable, or --- implement something like the restartable-NMI spec to be able to save state. -- what does "pulling the emergency cord" do? e.g. --- some kind of HW reset (we had a light reset at Intel that cleared as little as possible so that a post-mortem dump could identify what was going on) --- just vector to a SW handler (obviously this should depend on why the watchdog timer was activated, e.g. waiting for a HW event or SW event)
toggle quoted message
Show quoted text
On Wed, Mar 2, 2022 at 12:41 PM Kumar Sankaran < ksankaran@...> wrote: From a platform standpoint, the intent was to have a single platform
level watchdog that is shared across the entire platform. This
platform watchdog could be the 2-level watchdog as described below by
Greg. Whether S-mode software or M-mode software would handle the
tickling of this watchdog and handle timeouts is a subject for further
discussion.
On Wed, Mar 2, 2022 at 12:34 PM Greg Favor <gfavor@...> wrote:
>
> On Wed, Mar 2, 2022 at 12:23 PM Aaron Durbin <adurbin@...> wrote:
>>
>> Yes. Greg articulated what I was getting at better than I did. I apologize for muddying the waters. From a platform standpoint one system-level watchdog should suffice as it's typically the last resort of restarting a system prior to sending a tech out.
>
>
> One comment - for when any concrete discussion about having a system-level watchdog occurs:
>
> One can have a one-stage or a two-stage watchdog. The former yanks the emergency cord on the system upon timeout.
>
> The latter (which is what ARM defined in SBSA and the subsequent SBA) interrupts the OS on the first timeout and gives it a chance to take remedial actions (and refresh the watchdog). Then, if a second timeout occurs (without a refresh after the first timeout), the emergency cord is yanked.
>
> ARM also defined separate Secure and Non-Secure watchdogs (akin to what one might call S-mode and M-mode watchdogs). The OS has its own watchdog to tickle and an emergency situation results in reboot of the OS (for example). And the Secure Monitor has its own watchdog and an emergency situation results in reboot of the system (for example).
>
> Greg
>
>
--
Regards
Kumar
|
|
Even ARM SBSA allowed a lot of flexibility as to where the first-stage and second-stage timeout "signals" went (which ultimately then placed the handling in the hands of software somewhere). In other words, SBSA didn't prescribe the details of the overall watchdog handling picture.
Greg
toggle quoted message
Show quoted text
Now we're starting to drill down appropriately. There is a wide range. This is me thinking out loud and trying desperately to avoid the real work I should be doing:
- A watchdog time event can cause an interrupt (as opposed to a HW reset) -- maskable or non-maskable? -- Using xTVEC to vector or a platform defined vector.? (e.g. the reset vector) -- A new cause type or reuse an existing one? (e.g.using the reset cause) -- restartable or non-restartable or both? (both implies - to me at least- the 2 stage watchdog concept, "pulling the emergency cord") If the watchdog timer is restartable, either it must --- be maskable, or --- implement something like the restartable-NMI spec to be able to save state. -- what does "pulling the emergency cord" do? e.g. --- some kind of HW reset (we had a light reset at Intel that cleared as little as possible so that a post-mortem dump could identify what was going on) --- just vector to a SW handler (obviously this should depend on why the watchdog timer was activated, e.g. waiting for a HW event or SW event)
On Wed, Mar 2, 2022 at 12:41 PM Kumar Sankaran < ksankaran@...> wrote: From a platform standpoint, the intent was to have a single platform
level watchdog that is shared across the entire platform. This
platform watchdog could be the 2-level watchdog as described below by
Greg. Whether S-mode software or M-mode software would handle the
tickling of this watchdog and handle timeouts is a subject for further
discussion.
On Wed, Mar 2, 2022 at 12:34 PM Greg Favor <gfavor@...> wrote:
>
> On Wed, Mar 2, 2022 at 12:23 PM Aaron Durbin <adurbin@...> wrote:
>>
>> Yes. Greg articulated what I was getting at better than I did. I apologize for muddying the waters. From a platform standpoint one system-level watchdog should suffice as it's typically the last resort of restarting a system prior to sending a tech out.
>
>
> One comment - for when any concrete discussion about having a system-level watchdog occurs:
>
> One can have a one-stage or a two-stage watchdog. The former yanks the emergency cord on the system upon timeout.
>
> The latter (which is what ARM defined in SBSA and the subsequent SBA) interrupts the OS on the first timeout and gives it a chance to take remedial actions (and refresh the watchdog). Then, if a second timeout occurs (without a refresh after the first timeout), the emergency cord is yanked.
>
> ARM also defined separate Secure and Non-Secure watchdogs (akin to what one might call S-mode and M-mode watchdogs). The OS has its own watchdog to tickle and an emergency situation results in reboot of the OS (for example). And the Secure Monitor has its own watchdog and an emergency situation results in reboot of the system (for example).
>
> Greg
>
>
--
Regards
Kumar
|
|

Allen Baum
Don't they even define whether restartability is required or not?
toggle quoted message
Show quoted text
On Wed, Mar 2, 2022 at 4:00 PM Greg Favor < gfavor@...> wrote: Even ARM SBSA allowed a lot of flexibility as to where the first-stage and second-stage timeout "signals" went (which ultimately then placed the handling in the hands of software somewhere). In other words, SBSA didn't prescribe the details of the overall watchdog handling picture.
Greg Now we're starting to drill down appropriately. There is a wide range. This is me thinking out loud and trying desperately to avoid the real work I should be doing:
- A watchdog time event can cause an interrupt (as opposed to a HW reset) -- maskable or non-maskable? -- Using xTVEC to vector or a platform defined vector.? (e.g. the reset vector) -- A new cause type or reuse an existing one? (e.g.using the reset cause) -- restartable or non-restartable or both? (both implies - to me at least- the 2 stage watchdog concept, "pulling the emergency cord") If the watchdog timer is restartable, either it must --- be maskable, or --- implement something like the restartable-NMI spec to be able to save state. -- what does "pulling the emergency cord" do? e.g. --- some kind of HW reset (we had a light reset at Intel that cleared as little as possible so that a post-mortem dump could identify what was going on) --- just vector to a SW handler (obviously this should depend on why the watchdog timer was activated, e.g. waiting for a HW event or SW event)
On Wed, Mar 2, 2022 at 12:41 PM Kumar Sankaran < ksankaran@...> wrote: From a platform standpoint, the intent was to have a single platform
level watchdog that is shared across the entire platform. This
platform watchdog could be the 2-level watchdog as described below by
Greg. Whether S-mode software or M-mode software would handle the
tickling of this watchdog and handle timeouts is a subject for further
discussion.
On Wed, Mar 2, 2022 at 12:34 PM Greg Favor <gfavor@...> wrote:
>
> On Wed, Mar 2, 2022 at 12:23 PM Aaron Durbin <adurbin@...> wrote:
>>
>> Yes. Greg articulated what I was getting at better than I did. I apologize for muddying the waters. From a platform standpoint one system-level watchdog should suffice as it's typically the last resort of restarting a system prior to sending a tech out.
>
>
> One comment - for when any concrete discussion about having a system-level watchdog occurs:
>
> One can have a one-stage or a two-stage watchdog. The former yanks the emergency cord on the system upon timeout.
>
> The latter (which is what ARM defined in SBSA and the subsequent SBA) interrupts the OS on the first timeout and gives it a chance to take remedial actions (and refresh the watchdog). Then, if a second timeout occurs (without a refresh after the first timeout), the emergency cord is yanked.
>
> ARM also defined separate Secure and Non-Secure watchdogs (akin to what one might call S-mode and M-mode watchdogs). The OS has its own watchdog to tickle and an emergency situation results in reboot of the OS (for example). And the Secure Monitor has its own watchdog and an emergency situation results in reboot of the system (for example).
>
> Greg
>
>
--
Regards
Kumar
|
|
Don't they even define whether restartability is required or not?
Since the suitable response to a first or second stage timeout is rather system-specific, ARM didn't try to ordain exactly where the timeout signals go and what happens as a result. In SBSA they just described the general expected possibilities (which my previous remarks were based on). But here's what a 2020 version of BSA says (which is roughly similar to SBSA but a bit narrower in the possibilities it describes):
The basic function of the Generic Watchdog is to count for a fixed period of time, during which it expects to be refreshed by the system indicating normal operation. If a refresh occurs within the watch period, the period is refreshed to the start. If the refresh does not occur then the watch period expires, and a signal is raised and a second watch period is begun. The initial signal is typically wired to an interrupt and alerts the system. The system can attempt to take corrective action that includes refreshing the watchdog within the second watch period. If the refresh is successful, the system returns to the previous normal operation. If it fails, then the second watch period expires and a second signal is generated. The signal is fed to a higher agent as an interrupt or reset for it to take executive action.
Greg
On Wed, Mar 2, 2022 at 4:00 PM Greg Favor < gfavor@...> wrote: Even ARM SBSA allowed a lot of flexibility as to where the first-stage and second-stage timeout "signals" went (which ultimately then placed the handling in the hands of software somewhere). In other words, SBSA didn't prescribe the details of the overall watchdog handling picture.
Greg Now we're starting to drill down appropriately. There is a wide range. This is me thinking out loud and trying desperately to avoid the real work I should be doing:
- A watchdog time event can cause an interrupt (as opposed to a HW reset) -- maskable or non-maskable? -- Using xTVEC to vector or a platform defined vector.? (e.g. the reset vector) -- A new cause type or reuse an existing one? (e.g.using the reset cause) -- restartable or non-restartable or both? (both implies - to me at least- the 2 stage watchdog concept, "pulling the emergency cord") If the watchdog timer is restartable, either it must --- be maskable, or --- implement something like the restartable-NMI spec to be able to save state. -- what does "pulling the emergency cord" do? e.g. --- some kind of HW reset (we had a light reset at Intel that cleared as little as possible so that a post-mortem dump could identify what was going on) --- just vector to a SW handler (obviously this should depend on why the watchdog timer was activated, e.g. waiting for a HW event or SW event)
On Wed, Mar 2, 2022 at 12:41 PM Kumar Sankaran < ksankaran@...> wrote: From a platform standpoint, the intent was to have a single platform
level watchdog that is shared across the entire platform. This
platform watchdog could be the 2-level watchdog as described below by
Greg. Whether S-mode software or M-mode software would handle the
tickling of this watchdog and handle timeouts is a subject for further
discussion.
On Wed, Mar 2, 2022 at 12:34 PM Greg Favor <gfavor@...> wrote:
>
> On Wed, Mar 2, 2022 at 12:23 PM Aaron Durbin <adurbin@...> wrote:
>>
>> Yes. Greg articulated what I was getting at better than I did. I apologize for muddying the waters. From a platform standpoint one system-level watchdog should suffice as it's typically the last resort of restarting a system prior to sending a tech out.
>
>
> One comment - for when any concrete discussion about having a system-level watchdog occurs:
>
> One can have a one-stage or a two-stage watchdog. The former yanks the emergency cord on the system upon timeout.
>
> The latter (which is what ARM defined in SBSA and the subsequent SBA) interrupts the OS on the first timeout and gives it a chance to take remedial actions (and refresh the watchdog). Then, if a second timeout occurs (without a refresh after the first timeout), the emergency cord is yanked.
>
> ARM also defined separate Secure and Non-Secure watchdogs (akin to what one might call S-mode and M-mode watchdogs). The OS has its own watchdog to tickle and an emergency situation results in reboot of the OS (for example). And the Secure Monitor has its own watchdog and an emergency situation results in reboot of the system (for example).
>
> Greg
>
>
--
Regards
Kumar
|
|

Allen Baum
That's a bit looser a definition than I'd expect, but that explains your comments, certainly. Thx.
toggle quoted message
Show quoted text
On Wed, Mar 2, 2022 at 5:14 PM Greg Favor < gfavor@...> wrote: Don't they even define whether restartability is required or not?
Since the suitable response to a first or second stage timeout is rather system-specific, ARM didn't try to ordain exactly where the timeout signals go and what happens as a result. In SBSA they just described the general expected possibilities (which my previous remarks were based on). But here's what a 2020 version of BSA says (which is roughly similar to SBSA but a bit narrower in the possibilities it describes):
The basic function of the Generic Watchdog is to count for a fixed period of time, during which it expects to be refreshed by the system indicating normal operation. If a refresh occurs within the watch period, the period is refreshed to the start. If the refresh does not occur then the watch period expires, and a signal is raised and a second watch period is begun. The initial signal is typically wired to an interrupt and alerts the system. The system can attempt to take corrective action that includes refreshing the watchdog within the second watch period. If the refresh is successful, the system returns to the previous normal operation. If it fails, then the second watch period expires and a second signal is generated. The signal is fed to a higher agent as an interrupt or reset for it to take executive action.
Greg
On Wed, Mar 2, 2022 at 4:00 PM Greg Favor < gfavor@...> wrote: Even ARM SBSA allowed a lot of flexibility as to where the first-stage and second-stage timeout "signals" went (which ultimately then placed the handling in the hands of software somewhere). In other words, SBSA didn't prescribe the details of the overall watchdog handling picture.
Greg Now we're starting to drill down appropriately. There is a wide range. This is me thinking out loud and trying desperately to avoid the real work I should be doing:
- A watchdog time event can cause an interrupt (as opposed to a HW reset) -- maskable or non-maskable? -- Using xTVEC to vector or a platform defined vector.? (e.g. the reset vector) -- A new cause type or reuse an existing one? (e.g.using the reset cause) -- restartable or non-restartable or both? (both implies - to me at least- the 2 stage watchdog concept, "pulling the emergency cord") If the watchdog timer is restartable, either it must --- be maskable, or --- implement something like the restartable-NMI spec to be able to save state. -- what does "pulling the emergency cord" do? e.g. --- some kind of HW reset (we had a light reset at Intel that cleared as little as possible so that a post-mortem dump could identify what was going on) --- just vector to a SW handler (obviously this should depend on why the watchdog timer was activated, e.g. waiting for a HW event or SW event)
On Wed, Mar 2, 2022 at 12:41 PM Kumar Sankaran < ksankaran@...> wrote: From a platform standpoint, the intent was to have a single platform
level watchdog that is shared across the entire platform. This
platform watchdog could be the 2-level watchdog as described below by
Greg. Whether S-mode software or M-mode software would handle the
tickling of this watchdog and handle timeouts is a subject for further
discussion.
On Wed, Mar 2, 2022 at 12:34 PM Greg Favor <gfavor@...> wrote:
>
> On Wed, Mar 2, 2022 at 12:23 PM Aaron Durbin <adurbin@...> wrote:
>>
>> Yes. Greg articulated what I was getting at better than I did. I apologize for muddying the waters. From a platform standpoint one system-level watchdog should suffice as it's typically the last resort of restarting a system prior to sending a tech out.
>
>
> One comment - for when any concrete discussion about having a system-level watchdog occurs:
>
> One can have a one-stage or a two-stage watchdog. The former yanks the emergency cord on the system upon timeout.
>
> The latter (which is what ARM defined in SBSA and the subsequent SBA) interrupts the OS on the first timeout and gives it a chance to take remedial actions (and refresh the watchdog). Then, if a second timeout occurs (without a refresh after the first timeout), the emergency cord is yanked.
>
> ARM also defined separate Secure and Non-Secure watchdogs (akin to what one might call S-mode and M-mode watchdogs). The OS has its own watchdog to tickle and an emergency situation results in reboot of the OS (for example). And the Secure Monitor has its own watchdog and an emergency situation results in reboot of the system (for example).
>
> Greg
>
>
--
Regards
Kumar
|
|