Re: Watchdog timer per hart?

Allen Baum

Now we're starting to drill down appropriately. There is a wide range.
This is me thinking out loud and trying desperately to avoid the real work I should be doing:

 - A watchdog time event can cause an interrupt (as opposed to a HW reset)
  -- maskable or non-maskable? 
  -- Using xTVEC to vector or a platform defined vector.? (e.g. the reset vector)
  -- A new cause type or reuse an existing one? (e.g.using the reset cause)
  -- restartable or non-restartable or both? (both implies - to me at least-  the 2 stage watchdog concept, "pulling the emergency cord")
      If the watchdog timer is restartable, either it must
        --- be maskable, or 
        --- implement something like the restartable-NMI spec to be able to save state.
   -- what does "pulling the emergency cord" do? e.g. 
       --- some kind of HW reset (we had a light reset at Intel that cleared as little as possible so that a post-mortem dump could identify what was going on)
       --- just vector to a SW handler (obviously this should depend on why the watchdog timer was activated, e.g. waiting for a HW event or SW event)

On Wed, Mar 2, 2022 at 12:41 PM Kumar Sankaran <ksankaran@...> wrote:
From a platform standpoint, the intent was to have a single platform
level watchdog that is shared across the entire platform. This
platform watchdog could be the 2-level watchdog as described below by
Greg. Whether S-mode software or M-mode software would handle the
tickling of this watchdog and handle timeouts is a subject for further

On Wed, Mar 2, 2022 at 12:34 PM Greg Favor <gfavor@...> wrote:
> On Wed, Mar 2, 2022 at 12:23 PM Aaron Durbin <adurbin@...> wrote:
>> Yes. Greg articulated what I was getting at better than I did. I apologize for muddying the waters. From a platform standpoint one system-level watchdog should suffice as it's typically the last resort of restarting a system prior to sending a tech out.
> One comment - for when any concrete discussion about having a system-level watchdog occurs:
> One can have a one-stage or a two-stage watchdog.  The former yanks the emergency cord on the system upon timeout.
> The latter (which is what ARM defined in SBSA and the subsequent SBA) interrupts the OS on the first timeout and gives it a chance to take remedial actions (and refresh the watchdog).  Then, if a second timeout occurs (without a refresh after the first timeout), the emergency cord is yanked.
> ARM also defined separate Secure and Non-Secure watchdogs (akin to what one might call S-mode and M-mode watchdogs).  The OS has its own watchdog to tickle and an emergency situation results in reboot of the OS (for example).  And the Secure Monitor has its own watchdog and an emergency situation results in reboot of the system (for example).
> Greg


Join to automatically receive all group messages.