Preferred manner of supporting bus errors in RISC-V


Arjan Bink
 

Hi all,

 

We want to add support for ‘bus errors’ in our RISC-V design (e.g. signaled via AXI bresp/rresp signals). I studied a couple of different RISC-V architectures and I do not see a common approach for dealing with this.

 

Some examples:

 

  • SiFive uses a ‘bus error unit’ that converts bus errors into regular interrupts
  • Ibex implements precise bus errors and causes exceptions using RISC-V defined mcause exception codes (i.e. instruction access fault (exception code 1), load access fault (exception code 5), store/AMO access fault (exception code 7)
  • SweRV-EL2 maps imprecise bus errors onto custom NMIs (and they also have precise bus errors).

 

The RISC-V Privileged specification hardly mentions this topics, but has the following quotes that might be related:

 

“Non-maskable interrupts (NMIs) are only used for hardware error conditions”

“Precise PMA traps might not always be possible, for example, when probing a legacy bus architecture that uses access failures as part of the discovery mechanism. In this case, error responses from slave devices will be reported as imprecise bus-error interrupts.”

 

In our design we will have a PMP (so exception codes 1, 5, 7 are used to report precise PMP exceptions), precise instruction bus errors, and imprecise data bus errors. What is the intended manner of dealing with these precise instruction bus errors and imprecise data bus errors? Should we cause NMIs for them? Should we map them to a regular exception non-interrupt) with mcause exception codes 1, 5, 7 (which would be confusing as software can then not distinguish them from the PMP errors and also code 5 and 7 would be used for both precise PMP exceptions and imprecise data bus exceptions). Usage of an external ‘bus error unit’ does not seem appropriate as it could easily cause an interrupt on a speculative (and never actually executed) instruction fetch.

 

So, is there any common or recommended manner of dealing with bus errors?

 

Best regards,

Arjan


Greg Chadwick
 

Hello,

Thanks for raising this Arjan, it's been a low-priority item on my TODO list to
open a discussion on bus errors for a while now (I work on Ibex amongst other
things at lowRISC).

I think RISC-V should allow implementation to choose whether not they want
precise or imprecise bus errors, which I think is the case now. However as you
point out the specification is pretty silent on the matter. Some wording around
what the possibilities might be and ensuring the specification doesn't prevent
certain options from working without good reason seems prudent.

In particular we have the issue of the mcause exception code for bus errors that
you raise. I believe codes 1,5 and 7 are meant to be PMP faults only. Ibex is
non-confirming at the moment due to its use of the same code for both PMP and
bus errors. I think SweRV may do the same (look at the EH1 source here:
https://github.com/chipsalliance/Cores-SweRV/blob/7332edc0adaa7e9a0c842d169154429e8d987786/design/lsu/lsu_lsc_ctl.sv#L211
when generating its exception packet it combines access and bus errors together
and only alters type for misaligned or not).  The Andes/Gowin N25 also looks to
use the PMP mcause codes for precise bus errors (see page 87 of
https://www.gowinsemi.com/upload/database_doc/586/document_ja/5de4c10ca33c9.pdf)

I don't really mind if we introduce a new code here or broaden the definition of
'access fault' to include non PMP errors like bus errors. It could even be left
implementation defined though I'd prefer a specification defined bus error
mcause.

I did also have some concerns around how precise bus errors interact with
interrupts. In particular if you have an outstanding memory access (that may or
may not see a bus error) and receive an interrupt is it permissible to
effectively ignore the interrupt until the potential bus error is resolved?
Again I think the specification gives implementations room to do different
things here as it's up to the implementation how an interrupt becomes pending
(see some extensive discussion here:
https://github.com/riscv/riscv-isa-manual/issues/544) some extra wording
somewhere to make it clear this is a possibility could be useful.

Cheers,

Greg Chadwick

On Wed, Feb 3, 2021 at 11:35 AM Arjan Bink <Arjan.Bink@...> wrote:

Hi all,

 

We want to add support for ‘bus errors’ in our RISC-V design (e.g. signaled via AXI bresp/rresp signals). I studied a couple of different RISC-V architectures and I do not see a common approach for dealing with this.

 

Some examples:

 

  • SiFive uses a ‘bus error unit’ that converts bus errors into regular interrupts
  • Ibex implements precise bus errors and causes exceptions using RISC-V defined mcause exception codes (i.e. instruction access fault (exception code 1), load access fault (exception code 5), store/AMO access fault (exception code 7)
  • SweRV-EL2 maps imprecise bus errors onto custom NMIs (and they also have precise bus errors).

 

The RISC-V Privileged specification hardly mentions this topics, but has the following quotes that might be related:

 

“Non-maskable interrupts (NMIs) are only used for hardware error conditions”

“Precise PMA traps might not always be possible, for example, when probing a legacy bus architecture that uses access failures as part of the discovery mechanism. In this case, error responses from slave devices will be reported as imprecise bus-error interrupts.”

 

In our design we will have a PMP (so exception codes 1, 5, 7 are used to report precise PMP exceptions), precise instruction bus errors, and imprecise data bus errors. What is the intended manner of dealing with these precise instruction bus errors and imprecise data bus errors? Should we cause NMIs for them? Should we map them to a regular exception non-interrupt) with mcause exception codes 1, 5, 7 (which would be confusing as software can then not distinguish them from the PMP errors and also code 5 and 7 would be used for both precise PMP exceptions and imprecise data bus exceptions). Usage of an external ‘bus error unit’ does not seem appropriate as it could easily cause an interrupt on a speculative (and never actually executed) instruction fetch.

 

So, is there any common or recommended manner of dealing with bus errors?

 

Best regards,

Arjan


Josh Scheid
 

The rasd group has started [https://lists.riscv.org/g/tech-rasd/topics] and it's charter, still being formulated, could help to address this.  Contribute if interested.

-Josh



On Wed, Feb 3, 2021 at 3:35 AM Arjan Bink <Arjan.Bink@...> wrote:

Hi all,

 

We want to add support for ‘bus errors’ in our RISC-V design (e.g. signaled via AXI bresp/rresp signals). I studied a couple of different RISC-V architectures and I do not see a common approach for dealing with this.

 

Some examples:

 

  • SiFive uses a ‘bus error unit’ that converts bus errors into regular interrupts
  • Ibex implements precise bus errors and causes exceptions using RISC-V defined mcause exception codes (i.e. instruction access fault (exception code 1), load access fault (exception code 5), store/AMO access fault (exception code 7)
  • SweRV-EL2 maps imprecise bus errors onto custom NMIs (and they also have precise bus errors).

 

The RISC-V Privileged specification hardly mentions this topics, but has the following quotes that might be related:

 

“Non-maskable interrupts (NMIs) are only used for hardware error conditions”

“Precise PMA traps might not always be possible, for example, when probing a legacy bus architecture that uses access failures as part of the discovery mechanism. In this case, error responses from slave devices will be reported as imprecise bus-error interrupts.”

 

In our design we will have a PMP (so exception codes 1, 5, 7 are used to report precise PMP exceptions), precise instruction bus errors, and imprecise data bus errors. What is the intended manner of dealing with these precise instruction bus errors and imprecise data bus errors? Should we cause NMIs for them? Should we map them to a regular exception non-interrupt) with mcause exception codes 1, 5, 7 (which would be confusing as software can then not distinguish them from the PMP errors and also code 5 and 7 would be used for both precise PMP exceptions and imprecise data bus exceptions). Usage of an external ‘bus error unit’ does not seem appropriate as it could easily cause an interrupt on a speculative (and never actually executed) instruction fetch.

 

So, is there any common or recommended manner of dealing with bus errors?

 

Best regards,

Arjan


Phil McCoy
 

FWIW the AIA (Advanced Interrupt Architecture) reserves interrupt 30 for "bus or system errors".


Allen Baum
 

I would have thought that NMI is where you would want to put bus or system errors. Not sure I like that, unless they're pretty benign errors (e.g. they're correctable, or even correctable and overflowed some limit on the number of correctable errors)

On Wed, Dec 22, 2021 at 6:40 AM Phil McCoy <pnm@...> wrote:
FWIW the AIA (Advanced Interrupt Architecture) reserves interrupt 30 for "bus or system errors".


Phil McCoy
 

Hi Allen,

NMI is not generally recoverable (i.e. if the NMI arrives when the hart is in the early parts of a Machine-Mode trap handler before mepc has been saved).
In some systems, Bus Errors can be used to probe what regions of the memory map are accessible.  Even excluding that case, there are use models where Bus Errors should be recovered gracefully.


Allen Baum
 

NMIs are usually, but not always irrecoverable, as you say (i.e. you can recover  if you're lucky about timing or about the use case - probing falls into that category),
 but there is a proposed extension to make them recoverable, precisely to enable graceful recovery (or, more likely, gracefully degrade and contain the failure. )
Absent that extension, bus errors known to be recoverable are in my "benign" category, and the rest are irrecoverable, so it is kind of irrelevant if the NMI is recoverable.
The case where an error might be recoverable or might not - that seems to be the tricky case. 
I can't think of any in that category offhand, but that ceartainly doesn't mean they don't exist.

So: short answer: yes, it makes sense to reserve an interrupt number for those types of failures.


On Wed, Dec 22, 2021 at 12:02 PM Phil McCoy <pnm@...> wrote:
Hi Allen,

NMI is not generally recoverable (i.e. if the NMI arrives when the hart is in the early parts of a Machine-Mode trap handler before mepc has been saved).
In some systems, Bus Errors can be used to probe what regions of the memory map are accessible.  Even excluding that case, there are use models where Bus Errors should be recovered gracefully.


Greg Favor
 

On Wed, Dec 22, 2021 at 11:26 AM Allen Baum <allen.baum@...> wrote:
I would have thought that NMI is where you would want to put bus or system errors. Not sure I like that, unless they're pretty benign errors (e.g. they're correctable, or even correctable and overflowed some limit on the number of correctable errors)

The intent, wrt AIA assigning or setting aside interrupt 30 for bus/system errors, is not that that is the one and only way through which such errors would be reported to system software.  Some systems may want to do that, others may want to use NMI, some may want to use a mixture of both, some may report these to a platform microcontroller, etc.  AIA isn't trying to ordain onto all AIA-based systems how they choose to handle bus/system errors.

Greg

On Wed, Dec 22, 2021 at 6:40 AM Phil McCoy <pnm@...> wrote:
FWIW the AIA (Advanced Interrupt Architecture) reserves interrupt 30 for "bus or system errors".