Re: Proposal for "Multiple LR/SC forward progress guarantee levels."


John Ingalls
 

Guo --

Thanks for the thoughtful reply.

How would the system or software use the proposed differing return codes?

-- John


On Sun, Dec 18, 2022, 6:40 AM Guo Ren <guoren@...> wrote:

[Edited Message Follows]

Derek,

Thx for sharing the presentation with us; I spent some time chewing. So the reply is late.


--- The problems of qspinlock ---

The Linux queued-spinlock utilizes the lock_pending and mcs_spinlock mechanisms to reduce the struct qspinlock into 4 bytes. (The size requirement of Linux for generic spinlock.) The current version of queued-spinlock has split struct qspinlock into two parts: locked_pending and tail (for mcs_spinlock). And in terms of software design, they are considered half-word isolated, which also means that the author of qspinlock will not make any promises when lock_pending affects xchg_tail (I have updated [1] pages 6 & 7 to explain this problem in detail).

This is a challenge for RISC-V and IBM Power, which don't have the sub-word atomic instructions. (I also see that x86 is using BINARY_RMW (bit atomic) instead of AMOOR.W to optimize the queued_fetch_set_pending_acquire() function), which makes me deeply disturbed (ps: the author of qspinlock is from x86).

 

--- Power's experience can't clear my concerns ---
Firstly, riscv has AMO instructions, while Power is a pure LR/SC architecture. At this point, the two are fundamentally different. The current RISC-V spec states that there is no forward guarantee between LR/SC and "AMO or store" (refer to [1] page 4). This implies that RISC-V AMO and LR/SC implementation are separate and unequal. This leads to a significant asymmetry in the riscv-Linux atomic RMW & CAS primitives. IBM Power is based on pure LR/SC instructions, at least power-Linux atomic RMW & CAS primitives are balanced.

Secondly, the cloud server scenario of Power's 240 cores 1920 threads cannot cover all qspinlock situations. Because qspinlock has four working states (uncontended, pending, uncontended queue, contended queue), it will enter the contended queue state when there are too many contended harts. The lock_pending contention has ended, and qspinlock is stable in the xchg_tail contended queue state. So it isn't a target situation for the topic.

The frequent switching between the states of uncontended, pending, and uncontended queues is the problem. For example, in an embedded SoC scenario, two sockets (2cores/socket non-SMT) are interconnected through PCI-E CCIX to form 2 NUMA nodes SMP (the NUMA interconnect performance of the system is not good, and there is a significant latency). A strong guarantee level can help CPU hold the cacheline during qspinlock xchg_tail, avoiding NUMA cacheline dancing overhead caused by retries. Because the core speed is much higher than the NUMA interconnect, locking the cacheline for some cycles within the internal is acceptable.
Finally, I would like to ask a question: There are no atomic transactions in IBM Power OpenAXON's SMP interconnect (similar to Atomic Transactions in AMBA CHI: Atomic, AtomicFetch, AtomicSwap, AtomicCmpare), right?

--- About "add new SC failure code" ---

Due to the conflict between "Absolute Guarantee" and "LR generates Load page-fault exception in the spec," I gave up the absolute level (updated [1] pages 12, 13). Now my proposal only wants to define Strong Level in the spec explicitly. This is not to pull RISC-V back to a strong guarantee but to define both Weak and Strong in the ISA spec, just like we've done for RVWMO and RVTSO. The spec hints at a Strong Guarantee (see [1] pages 10, 11). And [1] page 9 points out the current Strong Guarantee has had hardware implementations. So, why can't we define a Strong Guarantee?

Compared with the text description, using litmus to define strong and weak levels may be more transparent and easier to understand. Hence, the proposal introduces a new SC failure code and three litmus HAND cases (STRONG_LR-SC, WEAK_LR-SC, CAS_LR-LR -SC).


The problem with RISC-V's guarantee is that it splits LR/SC and AMO (which is not IBM Power-friendly). The specification should require that LR/SC and AMO maintain the same guarantee level. If LR/SC is a weak guarantee, the CPU vendor should decompose AMO into LR+SC+BNEZ instead of implementing the guaranteed AMO.

Finally, I suggest clearly defining Weak and Strong levels in the ISA spec, and AMO and LR/SC should have the same guarantee level. The software should be designed according to the weak guarantee for better compatibility and performance (Similar to RVWMO & RVTSO).

--- About cmpxchg forward progress guarantee ---
I mentioned on [1] pages 15 and 16 how to implement the CMPXCHG guarantee further. In CAS_LR-LR-SC, LR/SC could contain multiple LR instructions with the same address. During lrscCycles - Backoffs, if you encounter another LR instruction with the same address, it will be regarded as a regular load instruction. The cacheline is still an exclusive/modified state and does not enter the Backoff process until it encounters SC or cycles out.

--- About transactional memory requirement ---
I agree that no/weak guarantee atomic could gain performance benefits, and could we leave it to the T-extension? T-extension still needs traditional AMO to implement the slow path)


[1] https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing

Best Regards

Guo Ren

Join {tech-privileged@lists.riscv.org to automatically receive all group messages.