Date   

Architecture Review Committee meeting minutes, Jan 10

John Hauser
 

Pointer-masking extension

- On Monday, members of the ARC met with the J Extension Task Group to
review the status of the proposed pointer-masking extension.

- It is the sense of the ARC that the current draft strives to solve
multiple separate problems (notably "address sanitizing" for one, and
sandboxing for another) with a small combination of features that all
happen to involve modifying memory addresses. The ARC believes the
proposal would benefit from a clearer enumeration of all the intended
uses, and for each, a review of the possible new hardware that could
help, considered separately from the other use cases. The results of
this exercise might suggest a splitting of the current draft into two
or more smaller extensions, targeting different subsets of use cases
with different hardware. The path to completion and ratification
may also be easier for an initial extension with basic functionality
that supports straightforward use cases (such as HWASAN).

- Regarding claims that pointer-masking may be useful within OS code,
a concern was expressed that masking off the upper bits of pointers
could interfere with the typical convention that a virtual address's
most-significant bit (bit 63 or 31) reflects the privilege for the
addressed memory. Other architectures have encountered similar
issues, and an appropriate solution needs to be found for RISC-V.

- Along similar lines, the ARC was uncertain whether there is enough
experience with some of the proposed use cases when in untranslated
modes of operation (M-mode and Bare translation mode). An initial
extension should probably concentrate on the most well-established
use cases in translated modes of operation. A follow-on extension
can then tackle the specifics for operation in untranslated modes.

- Other feedback will be given to the J Extension TG.

Vector crypto extension

- The proposed draft is believed to still need work to meet RVIA
standards for correctness and clarity. Specific further feedback
is being provided.

This week's committee chairs meeting was recapped and discussed.

There was also a general discussion about steps ARC members need to
take in order for some outstanding fast-track extensions to progress,
such as Zfa, Zimops, and Smrnmi.


Re: unsupported counters

Andrew Waterman
 

The implementation has some latitude here, but implementations I’m aware of do hardwire all these bits to zero in that case.

If the corresponding xcounteren bits are read-only zero, then unprivileged accesses to the counter will trap. If that’s acceptable, then go for it. If you want unprivileged software to be able to access the counter (which admittedly isn’t terribly useful if it’s hardwired to zero), then don’t hardwire the xcounteren bit to zero.

I would recommend making the mcountinhibit but read-only, since making it writable serves no purpose if the counter is read-only zero. But this also isn’t a strict requirement.

On Fri, Jan 13, 2023 at 11:35 AM Beeman Strong <beeman@...> wrote:
Hi there,

For an implementation that does not support all 29 programmable HPM counters, what is the expected behavior of bits in mcountinhibit and xcounteren associated with the unsupported counters?  If hpmevent31 and hpmcounter31 are read-only 0, should mcountinhibit[31] and xcounteren[31] also be read-only 0?

thanks,
beeman


unsupported counters

Beeman Strong
 

Hi there,

For an implementation that does not support all 29 programmable HPM counters, what is the expected behavior of bits in mcountinhibit and xcounteren associated with the unsupported counters?  If hpmevent31 and hpmcounter31 are read-only 0, should mcountinhibit[31] and xcounteren[31] also be read-only 0?

thanks,
beeman


Re: Is behavior for out-of-range physical addresses explicitly specified?

Paul Donahue
 

In this example, an address above 2^56 would access a vacant PMA region.  Regardless of the outcome of the PMP check, accessing a vacant PMA region will cause an access fault.

I think that you have a very old spec because the upper bits of pmpaddr changed from WIRI to WARL back in version 1.11 and the only legal value is 0.  The whole concept of WIRI hasn't existed for years.


Thanks,

-Paul


On Thu, Jan 12, 2023 at 4:45 AM Abel Bernabeu via lists.riscv.org <abel.bernabeu=esperantotech.com@...> wrote:

With that clarification in place, the comparison of an incoming address against a range happens in the 64 bits space.

On Thu, Jan 12, 2023 at 1:39 PM Abel Bernabeu via lists.riscv.org <abel.bernabeu=esperantotech.com@...> wrote:
The pmpaddr CSRs have WIRI bits at the top, suggesting to discard bits [63:56].

And when discarding bits, one should assume some value for the discarded bits... but there is however no mention of what value to assume for the discarded bits.

A note explicitly instructing to assume zero would seem like a sensible thing to have.

Regards.

On Thu, Jan 12, 2023 at 12:47 PM <kenney@...> wrote:
Does the RISC-V architecture require particular behavior when physical addresses outside the implemented range are used?

Suppose for example that 56 bits of physical memory are implemented. Is an access with non-zero bits in the range [63:56] required to trap, or is it permitted to discard these bits prior to any PMP checks, effectively wrapping the address range?

(I'm thinking about a system with M+U modes only here.)

Thanks,

James.


Re: Is behavior for out-of-range physical addresses explicitly specified?

Abel Bernabeu
 

With that clarification in place, the comparison of an incoming address against a range happens in the 64 bits space.

On Thu, Jan 12, 2023 at 1:39 PM Abel Bernabeu via lists.riscv.org <abel.bernabeu=esperantotech.com@...> wrote:

The pmpaddr CSRs have WIRI bits at the top, suggesting to discard bits [63:56].

And when discarding bits, one should assume some value for the discarded bits... but there is however no mention of what value to assume for the discarded bits.

A note explicitly instructing to assume zero would seem like a sensible thing to have.

Regards.

On Thu, Jan 12, 2023 at 12:47 PM <kenney@...> wrote:
Does the RISC-V architecture require particular behavior when physical addresses outside the implemented range are used?

Suppose for example that 56 bits of physical memory are implemented. Is an access with non-zero bits in the range [63:56] required to trap, or is it permitted to discard these bits prior to any PMP checks, effectively wrapping the address range?

(I'm thinking about a system with M+U modes only here.)

Thanks,

James.


Re: Is behavior for out-of-range physical addresses explicitly specified?

Abel Bernabeu
 

The pmpaddr CSRs have WIRI bits at the top, suggesting to discard bits [63:56].

And when discarding bits, one should assume some value for the discarded bits... but there is however no mention of what value to assume for the discarded bits.

A note explicitly instructing to assume zero would seem like a sensible thing to have.

Regards.


On Thu, Jan 12, 2023 at 12:47 PM <kenney@...> wrote:
Does the RISC-V architecture require particular behavior when physical addresses outside the implemented range are used?

Suppose for example that 56 bits of physical memory are implemented. Is an access with non-zero bits in the range [63:56] required to trap, or is it permitted to discard these bits prior to any PMP checks, effectively wrapping the address range?

(I'm thinking about a system with M+U modes only here.)

Thanks,

James.


Is behavior for out-of-range physical addresses explicitly specified?

kenney@...
 

Does the RISC-V architecture require particular behavior when physical addresses outside the implemented range are used?

Suppose for example that 56 bits of physical memory are implemented. Is an access with non-zero bits in the range [63:56] required to trap, or is it permitted to discard these bits prior to any PMP checks, effectively wrapping the address range?

(I'm thinking about a system with M+U modes only here.)

Thanks,

James.


Re: Cache Block Operations by Index #github #riscv #cmo

Phil McCoy
 

Hi Krishna,

I'm interested in that and any other unfinished work from the CMO TG, but did not have enough time available to try to lead/drive that effort (and still won't for at least the next few months).  I would be happy to contribute in a smaller capacity if someone else is interested/available to lead the effort.

Cheers,
Phil

P.S.  your name sounds familiar; are you the Krishna Nagar that worked at MIPS a few years back?


Re: Cache Block Operations by Index #github #riscv #cmo

Greg Favor
 

The CMO TG completed and ratified what was referred to as phase 1 work at the time.  Per RVI procedures, that TG was wrapped up and any new and further CMO efforts are expected to be pursued via a new TG group chartered to specifically focus on a chosen goal or related set of goals.  Some email discussion occurred about possible next topics for a follow-on "phase 2" TG, but no interested parties stepped forward to pursue creating a new TG to pursue any of those topics.  So the door remains open for that to happen.

Greg


On Sun, Jan 8, 2023 at 12:19 AM <krishna.nagar@...> wrote:
Cache maintenance operations roadmap suggests there are plans to add Cache Block Operations by Index. Has there been any decision if these will be added to the spec? If so, what will be the instruction encoding of these instructions?

riscv-CMOs/CMO-Phase-1-Scope.md at master · riscv/riscv-CMOs · GitHub


Cache Block Operations by Index #github #riscv #cmo

krishna.nagar@...
 

Cache maintenance operations roadmap suggests there are plans to add Cache Block Operations by Index. Has there been any decision if these will be added to the spec? If so, what will be the instruction encoding of these instructions?

riscv-CMOs/CMO-Phase-1-Scope.md at master · riscv/riscv-CMOs · GitHub


Architecture Review Committee meeting minutes, Jan 3

John Hauser
 

Pointer-masking extension

- The ARC proposed a meeting with the J Extension Task Group to discuss
some specifics of the latest draft proposal.

IOMMU

- Discussed the ongoing rework of the IOMMU draft specification's
labeling of stages of address translation. The ARC expressed
a preference for renaming IOMMU "S-stage" and "VS-stage"
to "first-stage"/"stage 1", and renaming IOMMU "G-stage" to
"second-stage"/"stage 2". The committee also believes the IOMMU
document will need commentary about mapping a hart's address
translation stages (single-stage, VS-stage, and G-stage) to the
first- and second-stage translations supported by an IOMMU.

RAS exception codes and interrupt numbers

- Reviewed earlier debates about the need for one or more distinct
exception codes to indicate "corrupt data" when an explicit or
implicit memory read does not complete successfully due to an
uncorrected memory error. Re-affirmed earlier conclusions that:
(a) corrupt data errors should not be reported as access faults, and
(b) a single "corrupt data" exception code appears to be adequate,
because systems are expected to provide any additional information
about memory errors through a separate "RAS" interface.

Consequently, the ARC is inviting a proposal for a fast-track
extension---probably named Ssecorrupt---that defines exception
code 32 to mean "corrupt data" and specifies the circumstances under
which a synchronous exception trap with this code will be taken.
(For RV32, a new high-half CSR, medelegh, is also needed, although
an implementation may make it simply read-only zero.)

- Resumed a debate about the justification for a possible fast-track
extension to extend the AIA (Advanced Interrupt Architecture) by
defining two local interrupt numbers to use for RAS interrupts, and
nothing else. No firm conclusion reached on this topic.


Architecture Review Committee meeting minutes, 12/20

Andrew Waterman
 

Procedural

- AR plans to write a spec-writing guide in early 2023.
- No AR meeting next week.


PLIC

- Discussed PLIC public review feedback. Removing references to H mode is
correct. Furthermore, references to U-mode interrupts should be purged,
since there is no ratified U-mode interrupt support. We recommend adding
a non-normative note explaining that the PLIC spec _could_ be generalized
to support U-mode interrupts at a later date, to obviate questions on
the subject.


Vector crypto

- Starting our final review, with aim to complete in early January.


Pointer masking

- Discussed feedback recently provided directly to the TG about the most
recent spec revision. Waiting for TG's response.


Profiles

- We brainstormed about what features should drive the next major
profile release. No firm conclusions yet.


Re: Proposal for "Multiple LR/SC forward progress guarantee levels."

striker@...
 

Guo, 

 

You're welcome for the presentation. No need to apologize for the delay in replying, this change isn't going to happen in the architecture in any event, so there’s no specific hurry here.  

 

I don’t say that to be rude or dismissive, but I do not want to give the false impression, by replying to these emails, that I think this proposal has any legitimacy or should ever happen in the architecture… I don’t and it shouldn’t.   


However, I am happy to discuss these issues though there are limits to how much time I have to spend on this conversation. 

 

I’m gratified that the “absolute” guarantee level has been removed. That was never going to be possible, so that’s progress. Thank you. Now, if we can just get past the rest of it….  

 

Your email below throws a great many things at the wall. I do not have time now to go through all that point by point. Instead, I’d like us to focus on one thing: 

 

What changes do you think you want in the architecture and how will software use those new features? (for now, how you’ll try to justify them later is not important, please just explain exactly WHAT you want and how you plan for software to use the new features).

 

Given your last reply to John, it appears that now you seem to want a number of ill-defined, microarchitecture-specific return codes on SC to say "why" the SC failed (we'll ignore for the moment that this is not a good way to specify big-A architecture, painful to implement and verify, and that differing implementations are not at all likely to be uniform on what constitutes each return code). 


More importantly, having those codes be returned by SC does nothing to provide a guarantee of forward progress. The best that can do is give you some indication after the fact (that may or may not be accurate) as to why the SCs are failing. 


So, are you now NOT asking for a forward progress guarantee? That would be progress because generally speaking that can't be reasonably built anyways. 


It is imperative that you get very clear and very specific on what you want and how software will use and benefit from these features. We will make no progress here (no pun intended) until that happens. 


Derek Williams




From: tech-privileged@... <tech-privileged@...> on behalf of Guo Ren <guoren@...>
Sent: Sunday, December 18, 2022 8:38 AM
To: tech-privileged@... <tech-privileged@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-privileged] Proposal for "Multiple LR/SC forward progress guarantee levels."
 
[Edited Message Follows] Derek, Thx for sharing the presentation with us; I spent some time chewing. So the reply is late. --- The problems of qspinlock --- The Linux queued-spinlock utilizes the lock_pending and mcs_spinlock mechanisms to
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

[Edited Message Follows]

Derek,

Thx for sharing the presentation with us; I spent some time chewing. So the reply is late.


--- The problems of qspinlock ---

The Linux queued-spinlock utilizes the lock_pending and mcs_spinlock mechanisms to reduce the struct qspinlock into 4 bytes. (The size requirement of Linux for generic spinlock.) The current version of queued-spinlock has split struct qspinlock into two parts: locked_pending and tail (for mcs_spinlock). And in terms of software design, they are considered half-word isolated, which also means that the author of qspinlock will not make any promises when lock_pending affects xchg_tail (I have updated [1] pages 6 & 7 to explain this problem in detail).

This is a challenge for RISC-V and IBM Power, which don't have the sub-word atomic instructions. (I also see that x86 is using BINARY_RMW (bit atomic) instead of AMOOR.W to optimize the queued_fetch_set_pending_acquire() function), which makes me deeply disturbed (ps: the author of qspinlock is from x86).

 

--- Power's experience can't clear my concerns ---
Firstly, riscv has AMO instructions, while Power is a pure LR/SC architecture. At this point, the two are fundamentally different. The current RISC-V spec states that there is no forward guarantee between LR/SC and "AMO or store" (refer to [1] page 4). This implies that RISC-V AMO and LR/SC implementation are separate and unequal. This leads to a significant asymmetry in the riscv-Linux atomic RMW & CAS primitives. IBM Power is based on pure LR/SC instructions, at least power-Linux atomic RMW & CAS primitives are balanced.

Secondly, the cloud server scenario of Power's 240 cores 1920 threads cannot cover all qspinlock situations. Because qspinlock has four working states (uncontended, pending, uncontended queue, contended queue), it will enter the contended queue state when there are too many contended harts. The lock_pending contention has ended, and qspinlock is stable in the xchg_tail contended queue state. So it isn't a target situation for the topic.

The frequent switching between the states of uncontended, pending, and uncontended queues is the problem. For example, in an embedded SoC scenario, two sockets (2cores/socket non-SMT) are interconnected through PCI-E CCIX to form 2 NUMA nodes SMP (the NUMA interconnect performance of the system is not good, and there is a significant latency). A strong guarantee level can help CPU hold the cacheline during qspinlock xchg_tail, avoiding NUMA cacheline dancing overhead caused by retries. Because the core speed is much higher than the NUMA interconnect, locking the cacheline for some cycles within the internal is acceptable.
Finally, I would like to ask a question: There are no atomic transactions in IBM Power OpenAXON's SMP interconnect (similar to Atomic Transactions in AMBA CHI: Atomic, AtomicFetch, AtomicSwap, AtomicCmpare), right?

--- About "add new SC failure code" ---

Due to the conflict between "Absolute Guarantee" and "LR generates Load page-fault exception in the spec," I gave up the absolute level (updated [1] pages 12, 13). Now my proposal only wants to define Strong Level in the spec explicitly. This is not to pull RISC-V back to a strong guarantee but to define both Weak and Strong in the ISA spec, just like we've done for RVWMO and RVTSO. The spec hints at a Strong Guarantee (see [1] pages 10, 11). And [1] page 9 points out the current Strong Guarantee has had hardware implementations. So, why can't we define a Strong Guarantee?

Compared with the text description, using litmus to define strong and weak levels may be more transparent and easier to understand. Hence, the proposal introduces a new SC failure code and three litmus HAND cases (STRONG_LR-SC, WEAK_LR-SC, CAS_LR-LR -SC).


The problem with RISC-V's guarantee is that it splits LR/SC and AMO (which is not IBM Power-friendly). The specification should require that LR/SC and AMO maintain the same guarantee level. If LR/SC is a weak guarantee, the CPU vendor should decompose AMO into LR+SC+BNEZ instead of implementing the guaranteed AMO.

Finally, I suggest clearly defining Weak and Strong levels in the ISA spec, and AMO and LR/SC should have the same guarantee level. The software should be designed according to the weak guarantee for better compatibility and performance (Similar to RVWMO & RVTSO).

--- About cmpxchg forward progress guarantee ---
I mentioned on [1] pages 15 and 16 how to implement the CMPXCHG guarantee further. In CAS_LR-LR-SC, LR/SC could contain multiple LR instructions with the same address. During lrscCycles - Backoffs, if you encounter another LR instruction with the same address, it will be regarded as a regular load instruction. The cacheline is still an exclusive/modified state and does not enter the Backoff process until it encounters SC or cycles out.

--- About transactional memory requirement ---
I agree that no/weak guarantee atomic could gain performance benefits, and could we leave it to the T-extension? T-extension still needs traditional AMO to implement the slow path)


[1] https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing

Best Regards

Guo Ren


Re: Proposal for "Multiple LR/SC forward progress guarantee levels."

Guo Ren
 

We can combine it with PMU to observe the system forward guarantee situation. e.g., If too many "1" failure codes exist, the cache contention is serious. If too many "2" failure codes exist, there may be too many exceptions or interruptions. We will even introduce more failure codes in the future. e.g., "3" represents the reason for inter SMT harts contention.
 
In the current ISA spec, separating LR/SC from AMO is non-suitable; That would trap vendors from implementing balanced atomic primitives (Power is balanced. If a vendor wants weak, AMO also needs weak). We could describe it from the reason of SC and leave more freedom to micro-arch. Currently, I hope it would help define forward progress guarantee levels with litmus.


Re: Proposal for "Multiple LR/SC forward progress guarantee levels."

John Ingalls
 

Guo --

Thanks for the thoughtful reply.

How would the system or software use the proposed differing return codes?

-- John


On Sun, Dec 18, 2022, 6:40 AM Guo Ren <guoren@...> wrote:

[Edited Message Follows]

Derek,

Thx for sharing the presentation with us; I spent some time chewing. So the reply is late.


--- The problems of qspinlock ---

The Linux queued-spinlock utilizes the lock_pending and mcs_spinlock mechanisms to reduce the struct qspinlock into 4 bytes. (The size requirement of Linux for generic spinlock.) The current version of queued-spinlock has split struct qspinlock into two parts: locked_pending and tail (for mcs_spinlock). And in terms of software design, they are considered half-word isolated, which also means that the author of qspinlock will not make any promises when lock_pending affects xchg_tail (I have updated [1] pages 6 & 7 to explain this problem in detail).

This is a challenge for RISC-V and IBM Power, which don't have the sub-word atomic instructions. (I also see that x86 is using BINARY_RMW (bit atomic) instead of AMOOR.W to optimize the queued_fetch_set_pending_acquire() function), which makes me deeply disturbed (ps: the author of qspinlock is from x86).

 

--- Power's experience can't clear my concerns ---
Firstly, riscv has AMO instructions, while Power is a pure LR/SC architecture. At this point, the two are fundamentally different. The current RISC-V spec states that there is no forward guarantee between LR/SC and "AMO or store" (refer to [1] page 4). This implies that RISC-V AMO and LR/SC implementation are separate and unequal. This leads to a significant asymmetry in the riscv-Linux atomic RMW & CAS primitives. IBM Power is based on pure LR/SC instructions, at least power-Linux atomic RMW & CAS primitives are balanced.

Secondly, the cloud server scenario of Power's 240 cores 1920 threads cannot cover all qspinlock situations. Because qspinlock has four working states (uncontended, pending, uncontended queue, contended queue), it will enter the contended queue state when there are too many contended harts. The lock_pending contention has ended, and qspinlock is stable in the xchg_tail contended queue state. So it isn't a target situation for the topic.

The frequent switching between the states of uncontended, pending, and uncontended queues is the problem. For example, in an embedded SoC scenario, two sockets (2cores/socket non-SMT) are interconnected through PCI-E CCIX to form 2 NUMA nodes SMP (the NUMA interconnect performance of the system is not good, and there is a significant latency). A strong guarantee level can help CPU hold the cacheline during qspinlock xchg_tail, avoiding NUMA cacheline dancing overhead caused by retries. Because the core speed is much higher than the NUMA interconnect, locking the cacheline for some cycles within the internal is acceptable.
Finally, I would like to ask a question: There are no atomic transactions in IBM Power OpenAXON's SMP interconnect (similar to Atomic Transactions in AMBA CHI: Atomic, AtomicFetch, AtomicSwap, AtomicCmpare), right?

--- About "add new SC failure code" ---

Due to the conflict between "Absolute Guarantee" and "LR generates Load page-fault exception in the spec," I gave up the absolute level (updated [1] pages 12, 13). Now my proposal only wants to define Strong Level in the spec explicitly. This is not to pull RISC-V back to a strong guarantee but to define both Weak and Strong in the ISA spec, just like we've done for RVWMO and RVTSO. The spec hints at a Strong Guarantee (see [1] pages 10, 11). And [1] page 9 points out the current Strong Guarantee has had hardware implementations. So, why can't we define a Strong Guarantee?

Compared with the text description, using litmus to define strong and weak levels may be more transparent and easier to understand. Hence, the proposal introduces a new SC failure code and three litmus HAND cases (STRONG_LR-SC, WEAK_LR-SC, CAS_LR-LR -SC).


The problem with RISC-V's guarantee is that it splits LR/SC and AMO (which is not IBM Power-friendly). The specification should require that LR/SC and AMO maintain the same guarantee level. If LR/SC is a weak guarantee, the CPU vendor should decompose AMO into LR+SC+BNEZ instead of implementing the guaranteed AMO.

Finally, I suggest clearly defining Weak and Strong levels in the ISA spec, and AMO and LR/SC should have the same guarantee level. The software should be designed according to the weak guarantee for better compatibility and performance (Similar to RVWMO & RVTSO).

--- About cmpxchg forward progress guarantee ---
I mentioned on [1] pages 15 and 16 how to implement the CMPXCHG guarantee further. In CAS_LR-LR-SC, LR/SC could contain multiple LR instructions with the same address. During lrscCycles - Backoffs, if you encounter another LR instruction with the same address, it will be regarded as a regular load instruction. The cacheline is still an exclusive/modified state and does not enter the Backoff process until it encounters SC or cycles out.

--- About transactional memory requirement ---
I agree that no/weak guarantee atomic could gain performance benefits, and could we leave it to the T-extension? T-extension still needs traditional AMO to implement the slow path)


[1] https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing

Best Regards

Guo Ren


Re: Proposal for "Multiple LR/SC forward progress guarantee levels."

Guo Ren
 
Edited

Derek,

Thx for sharing the presentation with us; I spent some time chewing. So the reply is late.


--- The problems of qspinlock ---

The Linux queued-spinlock utilizes the lock_pending and mcs_spinlock mechanisms to reduce the struct qspinlock into 4 bytes. (The size requirement of Linux for generic spinlock.) The current version of queued-spinlock has split struct qspinlock into two parts: locked_pending and tail (for mcs_spinlock). And in terms of software design, they are considered half-word isolated, which also means that the author of qspinlock will not make any promises when lock_pending affects xchg_tail (I have updated [1] pages 6 & 7 to explain this problem in detail).

This is a challenge for RISC-V and IBM Power, which don't have the sub-word atomic instructions. (I also see that x86 is using BINARY_RMW (bit atomic) instead of AMOOR.W to optimize the queued_fetch_set_pending_acquire() function), which makes me deeply disturbed (ps: the author of qspinlock is from x86).

 

--- Power's experience can't clear my concerns ---
Firstly, riscv has AMO instructions, while Power is a pure LR/SC architecture. At this point, the two are fundamentally different. The current RISC-V spec states that there is no forward guarantee between LR/SC and "AMO or store" (refer to [1] page 4). This implies that RISC-V AMO and LR/SC implementation are separate and unequal. This leads to a significant asymmetry in the riscv-Linux atomic RMW & CAS primitives. IBM Power is based on pure LR/SC instructions, at least power-Linux atomic RMW & CAS primitives are balanced.

Secondly, the cloud server scenario of Power's 240 cores 1920 threads cannot cover all qspinlock situations. Because qspinlock has four working states (uncontended, pending, uncontended queue, contended queue), it will enter the contended queue state when there are too many contended harts. The lock_pending contention has ended, and qspinlock is stable in the xchg_tail contended queue state. So it isn't a target situation for the topic.

The frequent switching between the states of uncontended, pending, and uncontended queues is the problem. For example, in an embedded SoC scenario, two sockets (2cores/socket non-SMT) are interconnected through PCI-E CCIX to form 2 NUMA nodes SMP (the NUMA interconnect performance of the system is not good, and there is a significant latency). A strong guarantee level can help CPU hold the cacheline during qspinlock xchg_tail, avoiding NUMA cacheline dancing overhead caused by retries. Because the core speed is much higher than the NUMA interconnect, locking the cacheline for some cycles within the internal is acceptable.
Finally, I would like to ask a question: There are no atomic transactions in IBM Power OpenAXON's SMP interconnect (similar to Atomic Transactions in AMBA CHI: Atomic, AtomicFetch, AtomicSwap, AtomicCmpare), right?

--- About "add new SC failure code" ---

Due to the conflict between "Absolute Guarantee" and "LR generates Load page-fault exception in the spec," I gave up the absolute level (updated [1] pages 12, 13). Now my proposal only wants to define Strong Level in the spec explicitly. This is not to pull RISC-V back to a strong guarantee but to define both Weak and Strong in the ISA spec, just like we've done for RVWMO and RVTSO. The spec hints at a Strong Guarantee (see [1] pages 10, 11). And [1] page 9 points out the current Strong Guarantee has had hardware implementations. So, why can't we define a Strong Guarantee?

Compared with the text description, using litmus to define strong and weak levels may be more transparent and easier to understand. Hence, the proposal introduces a new SC failure code and three litmus HAND cases (STRONG_LR-SC, WEAK_LR-SC, CAS_LR-LR -SC).


The problem with RISC-V's guarantee is that it splits LR/SC and AMO (which is not IBM Power-friendly). The specification should require that LR/SC and AMO maintain the same guarantee level. If LR/SC is a weak guarantee, the CPU vendor should decompose AMO into LR+SC+BNEZ instead of implementing the guaranteed AMO.

Finally, I suggest clearly defining Weak and Strong levels in the ISA spec, and AMO and LR/SC should have the same guarantee level. The software should be designed according to the weak guarantee for better compatibility and performance (Similar to RVWMO & RVTSO).

--- About cmpxchg forward progress guarantee ---
I mentioned on [1] pages 15 and 16 how to implement the CMPXCHG guarantee further. In CAS_LR-LR-SC, LR/SC could contain multiple LR instructions with the same address. During lrscCycles - Backoffs, if you encounter another LR instruction with the same address, it will be regarded as a regular load instruction. The cacheline is still an exclusive/modified state and does not enter the Backoff process until it encounters SC or cycles out.

--- About transactional memory requirement ---
I agree that no/weak guarantee atomic could gain performance benefits, and could we leave it to the T-extension? T-extension still needs traditional AMO to implement the slow path)


[1] https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing

Best Regards

Guo Ren


Re: Proposal for "Multiple LR/SC forward progress guarantee levels."

Phil McCoy
 

For the record, plenty of big-iron SGI machines also survived the MIPS ISA's LL/SC without stronger forward progress guarantees.

The following observation is apparent from Derek's presentation, but I think it is worth calling out here for those who didn't make it through all 73 pages :-)

Consider the case of atomically incrementing a variable with level 2 or 3 forward progress guarantees:

loop:
lr.w x8, 0(x9)
addi x8, x8, 1
sc.w x8, x8, 0(x9)
bne x8, x0, loop

Suppose two harts execute this code sequence at the same time, and the address in question initially contains the value of zero.  In both harts, the lr.w will read zero which then gets incremented, so that the SC attempts to store the value 1 to memory.  Both harts are REQUIRED to successfully store the value of 1, but the correct result after both harts have executed the loop is 2.

As Derek points out in his presentation, attempts to fix this problem by some how pre-ordaining a winner break down in cases where the code sequence ends up not attempting to execute the sc.w instruction (which is a perfectly legal and valid thing for software to do).  Similarly, attempts to hide the existence of a loser by rolling the machine state back to prior to the loop may technically honor the forward progress guarantee (by never seeming to execute the atomic sequence), but actual forward progress is still not guaranteed, so it is just as "bad" for software as the weak guarantee.

Cheers,
Phil


Re: Question about The RISC-V Advanced Interrupt Architecture

John Hauser
 

Oscar Jupp wrote:
To whom it may concern,
This is table 7.1 of the RISC-V Advanced Interrupt Architecture spec (Document Version 1.0-RC1).

Why vsip[n] and vsie[n] is alias of sip[n] and sie[n], when hideleg[n] = 1?

The Privileged spec write:
“When bit 10 of hideleg is zero, vsip.SEIP and vsie.SEIE are read-only zeros. Else, vsip.SEIP
and vsie.SEIE are aliases of hip.VSEIP and hie.VSEIE.
When bit 6 of hideleg is zero, vsip.STIP and vsie.STIE are read-only zeros. Else, vsip.STIP
and vsie.STIE are aliases of hip.VSTIP and hie.VSTIE.
When bit 2 of hideleg is zero, vsip.SSIP and vsie.SSIE are read-only zeros. Else, vsip.SSIP and
vsie.SSIE are aliases of hip.VSSIP and hie.VSSIE."
The caption on Table 7.1 says:

The effects of hideleg and hvien on vsip and vsie for major
interrupts 13-63.

Bits 10, 6, and 2 in vsip are major interrupts 10, 6, and 2 for
VS level, so all of them are outside the range covered by the table
(major interrupts 16-63).

- John Hauser


Re: Proposal for "Multiple LR/SC forward progress guarantee levels."

striker@...
 


Guo, 

As the guy who probably worked the hardest to get the forward progress guarantees for LR/SC significantly weakened in the RISC-V ISA, I feel I’m going to have to comment here.  

I’ll start with the premise of the presentation (pg 6 and pg 16) that some fair forward progress guarantee (however that’s defined) is somehow “required” for the “queued-spinlock” and for transactional memory.  

I do not accept from a vague comment In the linux kernel that fair forward progress is absolutely required for this queued-spinlock to work. I can accept that if you build a bad implementation of LR/SC that’s unbalanced and unfair you will likely have problems with this algorithm, but you’ll also likely just have problems with heavily contended basic locks as well. Don’t do that.  

I also have a personal reason to doubt any sort of hard fairness requirement. Power machines (quite large ones: fully coherent with up to 240 cores with 1920 threads in total currently) have run this algorithm in linux for a long time with no particular problem (I own the majority of the LL/SC logic in Power, I would have known if there were significant issues and my kernel maintainer as of yesterday said this was no big deal for him). So, there’s at least one practical proof point that this doesn’t seem to demand a hard guarantee even in a very demanding environment.   

So, before anything goes anywhere here, can you, or someone helping you, explain in exquisite detail what the progress problem here really is, if any. I suspect there may be some issue here but I’m not sure it rises to the level that a fully fair forward guarantee is absolutely required.  

But even if there is a forward progress problem here, I cannot even begin to see what the proposed return codes are going to do to help you? Just what are you going to do with those codes?  

Additionally, I don’t think it will be practical to implement a machine with Level 2 (Strong) or Level 3 (Absolute) guarantees (pg 10 of your presentation) at anything approaching a rational cost in any event. I include below the 70+ page presentation deck we worked up years ago to convince people to NOT do a guarantee. I haven’t re-read this in years, so I may disagree with myself in places, and it was never intended for an audience quite this large, but in general it should hold up.

Additionally, even if you use CAS instead of LR/SC, say, how does the queued_spinlock have a forward progress guarantee?  CAS, as far as I understand it, checks the “compare” value and won’t swap if it mismatches. What’s to keep one thread (especially in high contention) from always losing by having an “old” compare value from the last CAS it did that’s already been overwritten by another thread’s CAS.? It’ll fail every time and not progress, right? IF you can handle the CAS failing regularly by some other software path, why can’t that same path be used for the SC that fails?  

SO, to summarize,  

1) I’m not sure you actually have a problem here. Much more detailed proof is required. 

2) Even if there is a problem, I’m not sure how these return codes help or that we really a fair forward progress guarantee to deal with this? 

3) You’re going to have a very hard time building a machine with LR/SC guarantees meaningfully stronger than what RISC-V currently has.  

 

Derek Williams 




From: tech-privileged@... <tech-privileged@...> on behalf of Guo Ren <guoren@...>
Sent: Monday, December 12, 2022 7:00 AM
To: tech-privileged@... <tech-privileged@...>
Subject: [EXTERNAL] [RISC-V] [tech-privileged] Proposal for "Multiple LR/SC forward progress guarantee levels."
 
[Edited Message Follows] Abstract Atomic Forward Guarantee is required in many OS to touch contended variables. Linux queued-spinlock is the atomic forward guarantee user who solves the fairness and cache-line bouncing problems and ensures
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

[Edited Message Follows]

Abstract

Atomic Forward Guarantee is required in many OS to touch contended variables. Linux queued-spinlock is the atomic forward guarantee user who solves the fairness and cache-line bouncing problems and ensures performance and size. RISC-V Linux has been trying to port qspinlock support since 2019, but it is pended on the RISC-V flexible LR/SC forward Guarantee issue. In this presentation, we will introduce the troubles caused by the LR/SC Forward Guarantee definition in the RISC-V ISA to Linux qspinlock and propose our solution. Use klitmus to define the LR/SC & cmpxchg forward progress guarantee cases, which gives clear guidelines for micro-architecture designers and software programmers. 

 

Motivation

The current LR/SC forward progress guarantee is written in RISC-V ISA spec - "Eventual Success of Store-Conditional Instructions":
As a consequence of the eventuality guarantee, if some harts in an execution environment are executing constrained LR/SC loops, and no other harts or devices in the execution environment execute an unconditional store or AMO to that reservation set, then at least one hart will eventually exit its constrained LR/SC loop.

HART 0                    HART 1
loop:                         loop:                         # a0 = 0x4000 (hart0 & 1)
lr.w t0, (a0)               lr.w t0, (a0)               # Load-Reserve original value
sc.w t1, a2, (a0)       sc.w t1, a2, (a0)       # Store-Conditional for AMOSWAP
bnez t1, loop            bnez t1, loop            # Retry if store-conditional failed
move a0, t0              move a0, t0              # Return original value
ret                             ret

 

By contrast, if other harts or devices continue to write to that reservation set, it is not guaranteed that any hart will exit its LR/SC loop. (Written in RISC-V ISA spec: "Eventual Success of Store-Conditional Instructions")

 

HART 0                   HART 1
loop:                       loop:                      # a0 = 0x4000 (hart0 & 1)
st.w zero, (a0)        lr.w t0, (a0)
j loop                      sc.w t1, a2, (a0)
                               bnez t1, loop
                               move a0, t0
                               ret

No Guarantee here! The current LR/SC forward progress guarantee is a weak definition. But Linux queued-spinlock needs forward progress guarantee.

 

Here is the comment written in Linux qspinlock.h:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/asm-generic/qspinlock.h

"qspinlock relies on a far greater (compared to asm-generic/spinlock.h) set of atomic operations to behave well together, please audit them carefully to ensure they all have forward progress. Many atomic operations may default to cmpxchg() loops which will not have good forward progress properties on LL/SC architectures."

 

First, LR/SC requirement in qspinlock when NR_CPUS < 16K.
static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
{
...
        return (u32)xchg16_relaxed(&lock->tail,
                                 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
}

 

RISC-V doesn’t have the sub-word amoswap, so use LR/SC equivalent:
static inline ulong __xchg16_relaxed(ulong new, void *ptr)
{
        ulong ret, tmp;
        ulong shif = ((ulong)ptr & 2) ? 16 : 0;
        ulong mask = 0xffff << shif;
        ulong *__ptr = (ulong *)((ulong)ptr & ~2);

        __asm__ __volatile__ (
                "0: lr.w %0, %2\n"
                " and %1, %0, %z3\n"
                " or %1, %1, %z4\n"
                " sc.w %1, %1, %2\n"
                " bnez %1, 0b\n"
                : "=&r" (ret), "=&r" (tmp), "+A" (*__ptr)
                : "rJ" (~mask), "rJ" (new << shif)
                : "memory");
        return (ulong)((ret & mask) >> shif);
}
We need a strong level of forward progress guarantee here to make xchg16 like an AMO to contend with a single-store instruction.

Second, cmpxchg forward progress guarantee requirement in qspinlock when NR_CPUS >= 16K.
#define __cmpxchg_relaxed(ptr, old, new, size) \
({ \

                __asm__ __volatile__ ( \
                        "0:    lr.w  %0, %2\n" \
                        " bne %0, %z3, 1f\n" \
                        "       sc.w %1, %z4, %2\n" \
                        " bnez %1, 0b\n" \
                        "1:\n" \
                        : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \
                        : "rJ" ((long)__old), "rJ" (__new) \
                        : "memory"); \

Linux cmpxchg is the CAS primitive, and RISC-V uses LR/SC equivalent.

static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
{
        u32 old, new, val = atomic_read(&lock->val);
        for (;;) {
                new = (val & _Q_LOCKED_PENDING_MASK) | tail;
                old = atomic_cmpxchg_relaxed(&lock->val, val, new);
                if (old == val)
                        break;
                val = old;
        }
        return old;
}

The second cmpxchg should cause a loop break in any contended situation!

 

Proposal for discussion:

In current riscv spec: “A” Standard Extension for Atomic Instructions - Load-Reserved/Store-Conditional Instructions

"The failure code with value 1 is reserved to encode an unspecified failure. Other failure codes are reserved at this time, and portable software should only assume the failure code will be non-zero. We reserve a failure code of 1 to mean “unspecified” so that simple implementations may return this value using the existing mux required for the SLT/SLTU instructions. More specific failure codes might be defined in future versions or extensions to the ISA."

 

Add a failure code of 2 to store-conditional instruction:

Store-conditional returns local failure code

0 - Success
1 - Failure with an unspecified reason (Include cache coherency reason)
2 - Failure with a local event (interrupts/traps)

 

Then, define three forward progress guarantee levels:

Level 1 (Weak) - SC returns 0/1/2
Level 2 (Strong) - SC returns 0/2, never returns 1.
Level 3 (Absolute) - SC returns 0, never returns 1/2.

 

Litmus Tests for (Level 1~3, cmpxchg forward progress guarantee)

Level 3 - Absolute:
RISCV LEVEL3-LR-SC
{
0:x5=x;
1:x5=x;
}
 P0                      | P1 ;
 ori x6,x0,1          | ori x6,x0,2 ;
 lr.w x7,0(x5)        | st.w x6,0(x5) ;
 sc.w x8,x6,0(x5) | ;
forall
(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 0)
Make LR/SC behave like one atomic instruction.

 

Level 2 - Strong:

RISCV LEVEL3-LR-SC
{
0:x5=x;
1:x5=x;
}
 P0                      | P1 ;
 ori x6,x0,1          | ori x6,x0,2 ;
 lr.w x7,0(x5)        | st.w x6,0(x5) ;
 sc.w x8,x6,0(x5) | ;
forall
(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/
(x=2 /\ 0:x7=2 /\ 0:x8 = 2) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 0) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 2)
Only local event makes Store-Conditional fail.

 

Level 1 - Weak:
RISCV LEVEL1-LR-SC
{
0:x5=x;
1:x5=x;
}
 P0                       | P1 ;
 ori x6,x0,1           | ori x6,x0,2 ;
 lr.w x7,0(x5)        | st.w x6,0(x5) ;
 sc.w x8,x6,0(x5) | ;
forall
(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 1) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 2) \/
(x=2 /\ 0:x7=2 /\ 0:x8 = 2) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 0)
No forward progress guarantee.

 

cmpxchg forward progress guarantee:

RISCV LEVE_3-LR-LR-SC (Absolute Level)
{
0:x5=x;
1:x5=x;
}

 P0                      | P1 ;
 ori x6,x0,1          | ori x6,x0,2 ;
 lr.w x7,0(x5)       | st.w x6,0(x5) ;
 bnez x6, 1f         | ;
 nop                     | ;
1:                         | ;
 lr.w x8,0(x5)        | ;
 sc.w x9,x6,0(x5)  | ;
forall
(x=1 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=0) \/
(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=0)

 

 

RISCV LEVEL_2-LR-LR-SC (Strong Level)
{
0:x5=x;
1:x5=x;
}

 P0                      | P1 ;
 ori x6,x0,1          | ori x6,x0,2 ;
 lr.w x7,0(x5)       | st.w x6,0(x5) ;
 bnez x6, 1f         | ;
 nop                     | ;
1:                         | ;
 lr.w x8,0(x5)        | ;
 sc.w x9,x6,0(x5) | ;
forall
(x=1 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=0) \/
(x=1 /\ 0:x7=0 /\ 0:x8=2 /\ 0:x9=0) \/
(x=2 /\ 0:x7=0 /\ 0:x8=2 /\ 0:x9=2) \/
(x=2 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=2) \/
(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=2) \/
(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=0)

 

Here is the presentation for detail:

https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing


Proposal for "Multiple LR/SC forward progress guarantee levels."

Guo Ren
 
Edited

Abstract

Atomic Forward Guarantee is required in many OS to touch contended variables. Linux queued-spinlock is the atomic forward guarantee user who solves the fairness and cache-line bouncing problems and ensures performance and size. RISC-V Linux has been trying to port qspinlock support since 2019, but it is pended on the RISC-V flexible LR/SC forward Guarantee issue. In this presentation, we will introduce the troubles caused by the LR/SC Forward Guarantee definition in the RISC-V ISA to Linux qspinlock and propose our solution. Use klitmus to define the LR/SC & cmpxchg forward progress guarantee cases, which gives clear guidelines for micro-architecture designers and software programmers. 

 

Motivation

The current LR/SC forward progress guarantee is written in RISC-V ISA spec - "Eventual Success of Store-Conditional Instructions":
As a consequence of the eventuality guarantee, if some harts in an execution environment are executing constrained LR/SC loops, and no other harts or devices in the execution environment execute an unconditional store or AMO to that reservation set, then at least one hart will eventually exit its constrained LR/SC loop.

HART 0                    HART 1
loop:                         loop:                         # a0 = 0x4000 (hart0 & 1)
lr.w t0, (a0)               lr.w t0, (a0)               # Load-Reserve original value
sc.w t1, a2, (a0)       sc.w t1, a2, (a0)       # Store-Conditional for AMOSWAP
bnez t1, loop            bnez t1, loop            # Retry if store-conditional failed
move a0, t0              move a0, t0              # Return original value
ret                             ret

 

By contrast, if other harts or devices continue to write to that reservation set, it is not guaranteed that any hart will exit its LR/SC loop. (Written in RISC-V ISA spec: "Eventual Success of Store-Conditional Instructions")

 

HART 0                   HART 1
loop:                       loop:                      # a0 = 0x4000 (hart0 & 1)
st.w zero, (a0)        lr.w t0, (a0)
j loop                      sc.w t1, a2, (a0)
                               bnez t1, loop
                               move a0, t0
                               ret

No Guarantee here! The current LR/SC forward progress guarantee is a weak definition. But Linux queued-spinlock needs forward progress guarantee.

 

Here is the comment written in Linux qspinlock.h:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/asm-generic/qspinlock.h

"qspinlock relies on a far greater (compared to asm-generic/spinlock.h) set of atomic operations to behave well together, please audit them carefully to ensure they all have forward progress. Many atomic operations may default to cmpxchg() loops which will not have good forward progress properties on LL/SC architectures."

 

First, LR/SC requirement in qspinlock when NR_CPUS < 16K.
static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
{
...
        return (u32)xchg16_relaxed(&lock->tail,
                                 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
}

 

RISC-V doesn’t have the sub-word amoswap, so use LR/SC equivalent:
static inline ulong __xchg16_relaxed(ulong new, void *ptr)
{
        ulong ret, tmp;
        ulong shif = ((ulong)ptr & 2) ? 16 : 0;
        ulong mask = 0xffff << shif;
        ulong *__ptr = (ulong *)((ulong)ptr & ~2);

        __asm__ __volatile__ (
                "0: lr.w %0, %2\n"
                " and %1, %0, %z3\n"
                " or %1, %1, %z4\n"
                " sc.w %1, %1, %2\n"
                " bnez %1, 0b\n"
                : "=&r" (ret), "=&r" (tmp), "+A" (*__ptr)
                : "rJ" (~mask), "rJ" (new << shif)
                : "memory");
        return (ulong)((ret & mask) >> shif);
}
We need a strong level of forward progress guarantee here to make xchg16 like an AMO to contend with a single-store instruction.

Second, cmpxchg forward progress guarantee requirement in qspinlock when NR_CPUS >= 16K.
#define __cmpxchg_relaxed(ptr, old, new, size) \
({ \

                __asm__ __volatile__ ( \
                        "0:    lr.w  %0, %2\n" \
                        " bne %0, %z3, 1f\n" \
                        "       sc.w %1, %z4, %2\n" \
                        " bnez %1, 0b\n" \
                        "1:\n" \
                        : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \
                        : "rJ" ((long)__old), "rJ" (__new) \
                        : "memory"); \

Linux cmpxchg is the CAS primitive, and RISC-V uses LR/SC equivalent.

static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
{
        u32 old, new, val = atomic_read(&lock->val);
        for (;;) {
                new = (val & _Q_LOCKED_PENDING_MASK) | tail;
                old = atomic_cmpxchg_relaxed(&lock->val, val, new);
                if (old == val)
                        break;
                val = old;
        }
        return old;
}

The second cmpxchg should cause a loop break in any contended situation!

 

Proposal for discussion:

In current riscv spec: “A” Standard Extension for Atomic Instructions - Load-Reserved/Store-Conditional Instructions

"The failure code with value 1 is reserved to encode an unspecified failure. Other failure codes are reserved at this time, and portable software should only assume the failure code will be non-zero. We reserve a failure code of 1 to mean “unspecified” so that simple implementations may return this value using the existing mux required for the SLT/SLTU instructions. More specific failure codes might be defined in future versions or extensions to the ISA."

 

Add a failure code of 2 to store-conditional instruction:

Store-conditional returns local failure code

0 - Success
1 - Failure with an unspecified reason (Include cache coherency reason)
2 - Failure with a local event (interrupts/traps)

 

Then, define three forward progress guarantee levels:

Level 1 (Weak) - SC returns 0/1/2
Level 2 (Strong) - SC returns 0/2, never returns 1.
Level 3 (Absolute) - SC returns 0, never returns 1/2.

 

Litmus Tests for (Level 1~3, cmpxchg forward progress guarantee)

Level 3 - Absolute:
RISCV LEVEL3-LR-SC
{
0:x5=x;
1:x5=x;
}
 P0                      | P1 ;
 ori x6,x0,1          | ori x6,x0,2 ;
 lr.w x7,0(x5)        | st.w x6,0(x5) ;
 sc.w x8,x6,0(x5) | ;
forall
(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 0)
Make LR/SC behave like one atomic instruction.

 

Level 2 - Strong:

RISCV LEVEL3-LR-SC
{
0:x5=x;
1:x5=x;
}
 P0                      | P1 ;
 ori x6,x0,1          | ori x6,x0,2 ;
 lr.w x7,0(x5)        | st.w x6,0(x5) ;
 sc.w x8,x6,0(x5) | ;
forall
(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/
(x=2 /\ 0:x7=2 /\ 0:x8 = 2) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 0) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 2)
Only local event makes Store-Conditional fail.

 

Level 1 - Weak:
RISCV LEVEL1-LR-SC
{
0:x5=x;
1:x5=x;
}
 P0                       | P1 ;
 ori x6,x0,1           | ori x6,x0,2 ;
 lr.w x7,0(x5)        | st.w x6,0(x5) ;
 sc.w x8,x6,0(x5) | ;
forall
(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 1) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 2) \/
(x=2 /\ 0:x7=2 /\ 0:x8 = 2) \/
(x=2 /\ 0:x7=0 /\ 0:x8 = 0)
No forward progress guarantee.

 

cmpxchg forward progress guarantee:

RISCV LEVE_3-LR-LR-SC (Absolute Level)
{
0:x5=x;
1:x5=x;
}

 P0                      | P1 ;
 ori x6,x0,1          | ori x6,x0,2 ;
 lr.w x7,0(x5)       | st.w x6,0(x5) ;
 bnez x6, 1f         | ;
 nop                     | ;
1:                         | ;
 lr.w x8,0(x5)        | ;
 sc.w x9,x6,0(x5)  | ;
forall
(x=1 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=0) \/
(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=0)

 

 

RISCV LEVEL_2-LR-LR-SC (Strong Level)
{
0:x5=x;
1:x5=x;
}

 P0                      | P1 ;
 ori x6,x0,1          | ori x6,x0,2 ;
 lr.w x7,0(x5)       | st.w x6,0(x5) ;
 bnez x6, 1f         | ;
 nop                     | ;
1:                         | ;
 lr.w x8,0(x5)        | ;
 sc.w x9,x6,0(x5) | ;
forall
(x=1 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=0) \/
(x=1 /\ 0:x7=0 /\ 0:x8=2 /\ 0:x9=0) \/
(x=2 /\ 0:x7=0 /\ 0:x8=2 /\ 0:x9=2) \/
(x=2 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=2) \/
(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=2) \/
(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=0)

 

Here is the presentation for detail:

https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing

21 - 40 of 1272