Fast-track extension proposal for "Multiple LR/SC forward progress guarantee levels."


andrew@...
 

This proposal is not appropriate for the fast track, on the basis that it’s far from uncontroversial. It should be discussed at length first.

On Mon, Dec 12, 2022 at 3:09 AM Guo Ren <guoren@...> wrote:

[Edited Message Follows]

Abstract

Atomic Forward Guarantee is required in many OS to touch contended variables. Linux queued-spinlock is the atomic forward guarantee user who solves the fairness and cache-line bouncing problems and ensures performance and size. RISC-V Linux has been trying to port qspinlock support since 2019, but it is pended on the RISC-V flexible LR/SC forward Guarantee issue. In this presentation, we will introduce the troubles caused by the LR/SC Forward Guarantee definition in the RISC-V ISA to Linux qspinlock and propose our solution. Use klitmus to define the LR/SC & cmpxchg forward progress guarantee cases, which gives clear guidelines for micro-architecture designers and software programmers. 

 

Motivation

The current LR/SC forward progress guarantee is written in RISC-V ISA spec - "Eventual Success of Store-Conditional Instructions":

As a consequence of the eventuality guarantee, if some harts in an execution environment are executing constrained LR/SC loops, and no other harts or devices in the execution environment execute an unconditional store or AMO to that reservation set, then at least one hart will eventually exit its constrained LR/SC loop.

 

HART 0                    HART 1

loop:                        loop:                         # a0 = 0x4000 (hart0 & 1)

lr.w t0, (a0)               lr.w t0, (a0)               # Load-Reserve original value

sc.w t1, a2, (a0)       sc.w t1, a2, (a0)       # Store-Conditional for AMOSWAP

bnez t1, loop           bnez t1, loop            # Retry if store-conditional failed

move a0, t0             move a0, t0              # Return original value

ret                            ret

 

By contrast, if other harts or devices continue to write to that reservation set, it is not guaranteed that any hart will exit its LR/SC loop. (Written in RISC-V ISA spec: "Eventual Success of Store-Conditional Instructions")

 

HART 0                   HART 1

loop:                       loop:                      # a0 = 0x4000 (hart0 & 1)

st.w zero, (a0)         lr.w t0, (a0)

j loop                      sc.w t1, a2, (a0)

                               bnez t1, loop

                               move a0, t0

                               ret

No Guarantee here! The current LR/SC forward progress guarantee is a weak definition. But Linux queued-spinlock needs forward progress guarantee.

 

Here is the comment written in Linux qspinlock.h:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/asm-generic/qspinlock.h

"qspinlock relies on a far greater (compared to asm-generic/spinlock.h) set of atomic operations to behave well together, please audit them carefully to ensure they all have forward progress. Many atomic operations may default to cmpxchg() loops which will not have good forward progress properties on LL/SC architectures."

 

First, LR/SC requirement in qspinlock when NR_CPUS < 16K.

static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)

{

...

        return (u32)xchg16_relaxed(&lock->tail,

                                 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;

}

 

RISC-V doesn’t have the sub-word amoswap, so use LR/SC equivalent:

static inline ulong __xchg16_relaxed(ulong new, void *ptr)

{

        ulong ret, tmp;

        ulong shif = ((ulong)ptr & 2) ? 16 : 0;

        ulong mask = 0xffff << shif;

        ulong *__ptr = (ulong *)((ulong)ptr & ~2);

 

        __asm__ __volatile__ (

                "0: lr.w %0, %2\n"

                " and %1, %0, %z3\n"

                " or %1, %1, %z4\n"

                " sc.w %1, %1, %2\n"

                " bnez %1, 0b\n"

                : "=&r" (ret), "=&r" (tmp), "+A" (*__ptr)

                : "rJ" (~mask), "rJ" (new << shif)

                : "memory");

 

        return (ulong)((ret & mask) >> shif);

}

We need a strong level of forward progress guarantee here to make xchg16 like an AMO to contend with a single-store instruction.

 

Second, cmpxchg forward progress guarantee requirement in qspinlock when NR_CPUS >= 16K.

#define __cmpxchg_relaxed(ptr, old, new, size) \

({ \

                __asm__ __volatile__ ( \

                        "0:    lr.w  %0, %2\n" \

                        " bne %0, %z3, 1f\n" \

                        "       sc.w %1, %z4, %2\n" \

                        " bnez %1, 0b\n" \

                        "1:\n" \

                        : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \

                        : "rJ" ((long)__old), "rJ" (__new) \

                        : "memory"); \

Linux cmpxchg is the CAS primitive, and RISC-V uses LR/SC equivalent.

 

static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)

{

        u32 old, new, val = atomic_read(&lock->val);

 

        for (;;) {

                new = (val & _Q_LOCKED_PENDING_MASK) | tail;

 

                old = atomic_cmpxchg_relaxed(&lock->val, val, new);

                if (old == val)

                        break;

 

                val = old;

        }

        return old;

}

The second cmpxchg should cause a loop break in any contended situation!

 

Proposal for discussion:

In current riscv spec: “A” Standard Extension for Atomic Instructions - Load-Reserved/Store-Conditional Instructions

"The failure code with value 1 is reserved to encode an unspecified failure. Other failure codes are reserved at this time, and portable software should only assume the failure code will be non-zero. We reserve a failure code of 1 to mean “unspecified” so that simple implementations may return this value using the existing mux required for the SLT/SLTU instructions. More specific failure codes might be defined in future versions or extensions to the ISA."

 

Add a failure code of 2 to store-conditional instruction:

Store-conditional returns local failure code

0 - Success

1 - Failure with an unspecified reason (Include cache coherency reason)

2 - Failure with a local event (interrupts/traps)

 

Then, define three forward progress guarantee levels:

Level 1 (Weak) - SC returns 0/1/2

Level 2 (Strong) - SC returns 0/2, never returns 1.

Level 3 (Absolute) - SC returns 0, never returns 1/2.

 

Litmus Tests for (Level 1~3, cmpxchg forward progress guarantee)

Level 3 - Absolute:

RISCV LEVEL3-LR-SC

{

0:x5=x;

1:x5=x;

}

 P0                      | P1 ;

 ori x6,x0,1          | ori x6,x0,2 ;

 lr.w x7,0(x5)        | st.w x6,0(x5) ;

 sc.w x8,x6,0(x5) | ;

forall

(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 0)

Make LR/SC behave like one atomic instruction.

 

Level 2 - Strong:

RISCV LEVEL3-LR-SC

{

0:x5=x;

1:x5=x;

}

 P0                      | P1 ;

 ori x6,x0,1          | ori x6,x0,2 ;

 lr.w x7,0(x5)        | st.w x6,0(x5) ;

 sc.w x8,x6,0(x5) | ;

forall

(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/

(x=2 /\ 0:x7=2 /\ 0:x8 = 2) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 0) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 2)

Only local event makes Store-Conditional fail.

 

Level 1 - Weak:

RISCV LEVEL1-LR-SC

{

0:x5=x;

1:x5=x;

}

 P0 | P1 ;

 ori x6,x0,1          | ori x6,x0,2 ;

 lr.w x7,0(x5)        | st.w x6,0(x5) ;

 sc.w x8,x6,0(x5) | ;

forall

(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 1) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 2) \/

(x=2 /\ 0:x7=2 /\ 0:x8 = 2) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 0)

No forward progress guarantee.

 

cmpxchg forward progress guarantee:

RISCV LEVE_3-LR-LR-SC (Absolute Level)

{

0:x5=x;

1:x5=x;

}

 P0 | P1 ;

 ori x6,x0,1          | ori x6,x0,2 ;

 lr.w x7,0(x5)        | st.w x6,0(x5) ;

 lr.w x8,0(x5)        | ;

 sc.w x9,x6,0(x5) | ;

forall

(x=1 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=0) \/

(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=0)

 

RISCV LEVEL_2-LR-LR-SC (Strong Level)

{

0:x5=x;

1:x5=x;

}

 P0 | P1 ;

 ori x6,x0,1          | ori x6,x0,2 ;

 lr.w x7,0(x5)        | st.w x6,0(x5) ;

 lr.w x8,0(x5)        | ;

 sc.w x9,x6,0(x5) | ;

forall

(x=1 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=0) \/

(x=1 /\ 0:x7=0 /\ 0:x8=2 /\ 0:x9=0) \/

(x=2 /\ 0:x7=0 /\ 0:x8=2 /\ 0:x9=2) \/

(x=2 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=2) \/

(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=2) \/

(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=0)

 

Here is the presentation for detail:

https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing


Guo Ren
 

Thx for reminding me, I would resend it with a proper title.

On Mon, Dec 12, 2022 at 7:14 PM Andrew Waterman <andrew@...> wrote:

This proposal is not appropriate for the fast track, on the basis that it’s far from uncontroversial. It should be discussed at length first.

On Mon, Dec 12, 2022 at 3:09 AM Guo Ren <guoren@...> wrote:

[Edited Message Follows]

Abstract

Atomic Forward Guarantee is required in many OS to touch contended variables. Linux queued-spinlock is the atomic forward guarantee user who solves the fairness and cache-line bouncing problems and ensures performance and size. RISC-V Linux has been trying to port qspinlock support since 2019, but it is pended on the RISC-V flexible LR/SC forward Guarantee issue. In this presentation, we will introduce the troubles caused by the LR/SC Forward Guarantee definition in the RISC-V ISA to Linux qspinlock and propose our solution. Use klitmus to define the LR/SC & cmpxchg forward progress guarantee cases, which gives clear guidelines for micro-architecture designers and software programmers.



Motivation

The current LR/SC forward progress guarantee is written in RISC-V ISA spec - "Eventual Success of Store-Conditional Instructions":

As a consequence of the eventuality guarantee, if some harts in an execution environment are executing constrained LR/SC loops, and no other harts or devices in the execution environment execute an unconditional store or AMO to that reservation set, then at least one hart will eventually exit its constrained LR/SC loop.



HART 0 HART 1

loop: loop: # a0 = 0x4000 (hart0 & 1)

lr.w t0, (a0) lr.w t0, (a0) # Load-Reserve original value

sc.w t1, a2, (a0) sc.w t1, a2, (a0) # Store-Conditional for AMOSWAP

bnez t1, loop bnez t1, loop # Retry if store-conditional failed

move a0, t0 move a0, t0 # Return original value

ret ret



By contrast, if other harts or devices continue to write to that reservation set, it is not guaranteed that any hart will exit its LR/SC loop. (Written in RISC-V ISA spec: "Eventual Success of Store-Conditional Instructions")



HART 0 HART 1

loop: loop: # a0 = 0x4000 (hart0 & 1)

st.w zero, (a0) lr.w t0, (a0)

j loop sc.w t1, a2, (a0)

bnez t1, loop

move a0, t0

ret

No Guarantee here! The current LR/SC forward progress guarantee is a weak definition. But Linux queued-spinlock needs forward progress guarantee.



Here is the comment written in Linux qspinlock.h:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/asm-generic/qspinlock.h

"qspinlock relies on a far greater (compared to asm-generic/spinlock.h) set of atomic operations to behave well together, please audit them carefully to ensure they all have forward progress. Many atomic operations may default to cmpxchg() loops which will not have good forward progress properties on LL/SC architectures."



First, LR/SC requirement in qspinlock when NR_CPUS < 16K.

static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)

{

...

return (u32)xchg16_relaxed(&lock->tail,

tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;

}



RISC-V doesn’t have the sub-word amoswap, so use LR/SC equivalent:

static inline ulong __xchg16_relaxed(ulong new, void *ptr)

{

ulong ret, tmp;

ulong shif = ((ulong)ptr & 2) ? 16 : 0;

ulong mask = 0xffff << shif;

ulong *__ptr = (ulong *)((ulong)ptr & ~2);



__asm__ __volatile__ (

"0: lr.w %0, %2\n"

" and %1, %0, %z3\n"

" or %1, %1, %z4\n"

" sc.w %1, %1, %2\n"

" bnez %1, 0b\n"

: "=&r" (ret), "=&r" (tmp), "+A" (*__ptr)

: "rJ" (~mask), "rJ" (new << shif)

: "memory");



return (ulong)((ret & mask) >> shif);

}

We need a strong level of forward progress guarantee here to make xchg16 like an AMO to contend with a single-store instruction.



Second, cmpxchg forward progress guarantee requirement in qspinlock when NR_CPUS >= 16K.

#define __cmpxchg_relaxed(ptr, old, new, size) \

({ \



__asm__ __volatile__ ( \

"0: lr.w %0, %2\n" \

" bne %0, %z3, 1f\n" \

" sc.w %1, %z4, %2\n" \

" bnez %1, 0b\n" \

"1:\n" \

: "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \

: "rJ" ((long)__old), "rJ" (__new) \

: "memory"); \

Linux cmpxchg is the CAS primitive, and RISC-V uses LR/SC equivalent.



static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)

{

u32 old, new, val = atomic_read(&lock->val);



for (;;) {

new = (val & _Q_LOCKED_PENDING_MASK) | tail;



old = atomic_cmpxchg_relaxed(&lock->val, val, new);

if (old == val)

break;



val = old;

}

return old;

}

The second cmpxchg should cause a loop break in any contended situation!



Proposal for discussion:

In current riscv spec: “A” Standard Extension for Atomic Instructions - Load-Reserved/Store-Conditional Instructions

"The failure code with value 1 is reserved to encode an unspecified failure. Other failure codes are reserved at this time, and portable software should only assume the failure code will be non-zero. We reserve a failure code of 1 to mean “unspecified” so that simple implementations may return this value using the existing mux required for the SLT/SLTU instructions. More specific failure codes might be defined in future versions or extensions to the ISA."



Add a failure code of 2 to store-conditional instruction:

Store-conditional returns local failure code

0 - Success

1 - Failure with an unspecified reason (Include cache coherency reason)

2 - Failure with a local event (interrupts/traps)



Then, define three forward progress guarantee levels:

Level 1 (Weak) - SC returns 0/1/2

Level 2 (Strong) - SC returns 0/2, never returns 1.

Level 3 (Absolute) - SC returns 0, never returns 1/2.



Litmus Tests for (Level 1~3, cmpxchg forward progress guarantee)

Level 3 - Absolute:

RISCV LEVEL3-LR-SC

{

0:x5=x;

1:x5=x;

}

P0 | P1 ;

ori x6,x0,1 | ori x6,x0,2 ;

lr.w x7,0(x5) | st.w x6,0(x5) ;

sc.w x8,x6,0(x5) | ;

forall

(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 0)

Make LR/SC behave like one atomic instruction.



Level 2 - Strong:

RISCV LEVEL3-LR-SC

{

0:x5=x;

1:x5=x;

}

P0 | P1 ;

ori x6,x0,1 | ori x6,x0,2 ;

lr.w x7,0(x5) | st.w x6,0(x5) ;

sc.w x8,x6,0(x5) | ;

forall

(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/

(x=2 /\ 0:x7=2 /\ 0:x8 = 2) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 0) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 2)

Only local event makes Store-Conditional fail.



Level 1 - Weak:

RISCV LEVEL1-LR-SC

{

0:x5=x;

1:x5=x;

}

P0 | P1 ;

ori x6,x0,1 | ori x6,x0,2 ;

lr.w x7,0(x5) | st.w x6,0(x5) ;

sc.w x8,x6,0(x5) | ;

forall

(x=1 /\ 0:x7=2 /\ 0:x8 = 0) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 1) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 2) \/

(x=2 /\ 0:x7=2 /\ 0:x8 = 2) \/

(x=2 /\ 0:x7=0 /\ 0:x8 = 0)

No forward progress guarantee.



cmpxchg forward progress guarantee:

RISCV LEVE_3-LR-LR-SC (Absolute Level)

{

0:x5=x;

1:x5=x;

}

P0 | P1 ;

ori x6,x0,1 | ori x6,x0,2 ;

lr.w x7,0(x5) | st.w x6,0(x5) ;

lr.w x8,0(x5) | ;

sc.w x9,x6,0(x5) | ;

forall

(x=1 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=0) \/

(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=0)



RISCV LEVEL_2-LR-LR-SC (Strong Level)

{

0:x5=x;

1:x5=x;

}

P0 | P1 ;

ori x6,x0,1 | ori x6,x0,2 ;

lr.w x7,0(x5) | st.w x6,0(x5) ;

lr.w x8,0(x5) | ;

sc.w x9,x6,0(x5) | ;

forall

(x=1 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=0) \/

(x=1 /\ 0:x7=0 /\ 0:x8=2 /\ 0:x9=0) \/

(x=2 /\ 0:x7=0 /\ 0:x8=2 /\ 0:x9=2) \/

(x=2 /\ 0:x7=2 /\ 0:x8=2 /\ 0:x9=2) \/

(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=2) \/

(x=2 /\ 0:x7=0 /\ 0:x8=0 /\ 0:x9=0)



Here is the presentation for detail:

https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing

--
Best Regards
Guo Ren