Re: Watchdog timer per hart?


Allen Baum
 

That's a bit looser a definition than I'd expect, but that explains your comments, certainly. Thx.

On Wed, Mar 2, 2022 at 5:14 PM Greg Favor <gfavor@...> wrote:
On Wed, Mar 2, 2022 at 4:54 PM Allen Baum <allen.baum@...> wrote:
Don't they even define whether restartability is required or not?

Since the suitable response to a first or second stage timeout is rather system-specific, ARM didn't try to ordain exactly where the timeout signals go and what happens as a result.  In SBSA they just described the general expected possibilities (which my previous remarks were based on).  But here's what a 2020 version of BSA says (which is roughly similar to SBSA but a bit narrower in the possibilities it describes):

The basic function of the Generic Watchdog is to count for a fixed period of time, during which it expects to be
refreshed by the system indicating normal operation. If a refresh occurs within the watch period, the period is
refreshed to the start. If the refresh does not occur then the watch period expires, and a signal is raised and a
second watch period is begun.

The initial signal is typically wired to an interrupt and alerts the system. The system can attempt to take
corrective action that includes refreshing the watchdog within the second watch period. If the refresh is
successful, the system returns to the previous normal operation. If it fails, then the second watch period
expires and a second signal is generated. The signal is fed to a higher agent as an interrupt or reset for it to
take executive action.

Greg
 

On Wed, Mar 2, 2022 at 4:00 PM Greg Favor <gfavor@...> wrote:
Even ARM SBSA allowed a lot of flexibility as to where the first-stage and second-stage timeout "signals" went (which ultimately then placed the handling in the hands of software somewhere).  In other words, SBSA didn't prescribe the details of the overall watchdog handling picture.

Greg

On Wed, Mar 2, 2022 at 2:35 PM Allen Baum <allen.baum@...> wrote:
Now we're starting to drill down appropriately. There is a wide range.
This is me thinking out loud and trying desperately to avoid the real work I should be doing:

 - A watchdog time event can cause an interrupt (as opposed to a HW reset)
  -- maskable or non-maskable? 
  -- Using xTVEC to vector or a platform defined vector.? (e.g. the reset vector)
  -- A new cause type or reuse an existing one? (e.g.using the reset cause)
  -- restartable or non-restartable or both? (both implies - to me at least-  the 2 stage watchdog concept, "pulling the emergency cord")
      If the watchdog timer is restartable, either it must
        --- be maskable, or 
        --- implement something like the restartable-NMI spec to be able to save state.
   -- what does "pulling the emergency cord" do? e.g. 
       --- some kind of HW reset (we had a light reset at Intel that cleared as little as possible so that a post-mortem dump could identify what was going on)
       --- just vector to a SW handler (obviously this should depend on why the watchdog timer was activated, e.g. waiting for a HW event or SW event)


On Wed, Mar 2, 2022 at 12:41 PM Kumar Sankaran <ksankaran@...> wrote:
From a platform standpoint, the intent was to have a single platform
level watchdog that is shared across the entire platform. This
platform watchdog could be the 2-level watchdog as described below by
Greg. Whether S-mode software or M-mode software would handle the
tickling of this watchdog and handle timeouts is a subject for further
discussion.

On Wed, Mar 2, 2022 at 12:34 PM Greg Favor <gfavor@...> wrote:
>
> On Wed, Mar 2, 2022 at 12:23 PM Aaron Durbin <adurbin@...> wrote:
>>
>> Yes. Greg articulated what I was getting at better than I did. I apologize for muddying the waters. From a platform standpoint one system-level watchdog should suffice as it's typically the last resort of restarting a system prior to sending a tech out.
>
>
> One comment - for when any concrete discussion about having a system-level watchdog occurs:
>
> One can have a one-stage or a two-stage watchdog.  The former yanks the emergency cord on the system upon timeout.
>
> The latter (which is what ARM defined in SBSA and the subsequent SBA) interrupts the OS on the first timeout and gives it a chance to take remedial actions (and refresh the watchdog).  Then, if a second timeout occurs (without a refresh after the first timeout), the emergency cord is yanked.
>
> ARM also defined separate Secure and Non-Secure watchdogs (akin to what one might call S-mode and M-mode watchdogs).  The OS has its own watchdog to tickle and an emergency situation results in reboot of the OS (for example).  And the Secure Monitor has its own watchdog and an emergency situation results in reboot of the system (for example).
>
> Greg
>
>



--
Regards
Kumar





Join tech-unixplatformspec@lists.riscv.org to automatically receive all group messages.