Topics

[EXTERNAL]Re: [RISC-V] [tech-unprivileged] PAUSE for LR/SC


Sanjay Patel
 

Hi Greg,

 

For now, here is a summary of my understanding of a comparison of PAUSE.LR vs MONITOR/WAIT. This is primarily based on my understanding of MIPS PAUSE, with a superficial understanding of x86 MONITOR/WAIT. To place this in the context of RISC-V, pls substitute LR for discrete instances of LL.

  • PAUSE.LR is only invoked if the lock that LL reads is set i.e., some other thread has set the lock.
  • The LL thus subsumes the function of the MONITOR and sets the “monitor” address as that of the LL address.
  • PAUSE.LR stalls forward progress on the thread until the lock bit maintained locally is cleared this is the event that MWAIT would otherwise wait on.

 

In the context of an LL/SC sequence, MONITOR is thus not needed. PAUSE.LR substitutes for MWAIT. This is a simple and reliable solution to reducing thread activity in obtaining a lock contended by multiple threads. It has worked for literally decades in MIPS code base. In my mind, from an implementation and architecture perspective, simpler has always been better. 😊

 

If MONITOR/WAIT can be used in other contexts that do not depend on LL, then MONITOR/WAIT would be more useful for RISC-V ISA. The only other situation I can think of is uncached addresses, and LL/SC has been extended to uncached addresses also.

 

Since MONITOR/WAIT has not been defined for RISC-V yet, it would make sense for us to go ahead with the custom PAUSE.LR and migrate to MONITOR/WAIT later once adopted.

 

I’ll respond with more detail on the instruction operation in my response to Allen Baum’s email.

 

Regards,

Sanjay

 

 

From: Greg Favor <gfavor@...>
Date: Tuesday, January 5, 2021 at 3:25 PM
To: "tech-unprivileged@..." <tech-unprivileged@...>, Sanjay Patel <spatel@...>
Subject: [EXTERNAL]Re: [RISC-V] [tech-unprivileged] PAUSE for LR/SC

 

Sanjay,

 

This instruction sounds like a more specialized form of the x86 MONITOR/MWAIT instructions?  From a RISC-V architecture standardization perspective, good candidates for standardization have broader (although not necessarily universal) applicability.  In that sense, wouldn't something akin to MONITOR/MWAIT be a suitable RV extension (that would also cover the specific use case you have)?  Or in what way are these two different animals?

 

Greg

 

 

On Tue, Jan 5, 2021 at 3:04 PM Sanjay Patel <spatel@...> wrote:

Hi Folks,

We have defined a custom instruction equivalent to MIPS PAUSE which deschedules the instruction stream when an LL(==RISC-V LR) fails to acquire the lock. If a snoop is detected against the LR address then execution continues beyond the PAUSE and an attempt is made to acquire the lock again. 

This instruction will be named PAUSE.LR and will be a custom instruction in our case unless we can leverage the definition of RISC-V PAUSE. The RISC-V PAUSE as defined (and is in review) suggests that it be used for non-memory events. There is a subsequent suggestion that an rs1 field be used so perhaps rs1=R0 could be used to specify a PAUSE.LR like operation.

There isn't much activity on this
https://lists.riscv.org/g/tech-unprivileged/message/3?p=,,,20,0,0,0::relevance,,PAUSE,20,2,0,76890707
nor the 45-day review
https://lists.riscv.org/g/tech-compliance/message/287?p=,,,20,0,0,0::relevance,,PAUSE,20,2,0,78569228

so I've created a new topic.

Your feedback is welcome with the hope that we can avoid a custom PAUSE.LR.

Sanjay


Greg Favor
 

In line with Allen's comments, this "PAUSE.LR" is actually not so much like MONITOR/MWAIT.  It seems more like ARMv8's WFE functionality (which also includes the idea of events, local and global monitors, and the SEV instruction).  At the surface it seems like WFE would often be used in a similar manner and situation as PAUSE.LR.  But (caveat) I haven't cross-checked details.

Greg




Allen Baum
 

I guess I am even more confused now than I was before.
  • PAUSE.LR is only invoked if the lock that LL reads is set i.e., some other thread has set the lock.
I'm not sure what you mean when you say pause.LLR is "invoked"
I am guessing you mean that if executed, it will pause if a previous LR has reserved some address, and then had the reservation cancelled by someone else.?

The semantics here don't quite match what you're assuming. There is no lock, as such. 
If there were, then only the lock owner could unlock it  -- but a reservation and be cancelled by anyone by simply writing to any address in the reservation set.
And LR always succeeds at making a valid (==aligned) reservation (though it can be snatched away immediately)
So, there is no notification at the time of the LR that there was a reservation by anyone else.
In fact, multiple harts could have simultaneous reservations - and it isn't until the first one executes an SC and succeeds that the hart can conclude that it actually "owned" the "lock".

I may be misinterpreting your intent here (I'm fairly certain I must be).
Could you write down a snippet of code (with comments) to illustrate what you mean?

On Wed, Jan 6, 2021 at 10:06 AM Greg Favor <gfavor@...> wrote:
In line with Allen's comments, this "PAUSE.LR" is actually not so much like MONITOR/MWAIT.  It seems more like ARMv8's WFE functionality (which also includes the idea of events, local and global monitors, and the SEV instruction).  At the surface it seems like WFE would often be used in a similar manner and situation as PAUSE.LR.  But (caveat) I haven't cross-checked details.

Greg




Krste Asanovic
 

MIPS uses LL/SC to acquire locks, versus RISC-V which has AMOs.

To help answer Allen's question, from MIPS manual entry on PAUSE:

acquire_lock:
ll t0, 0(a0) /* Read software lock, set hardware lock */
bnez t0, acquire_lock_retry:/* Branch if software lock is taken */
addiu t0, t0, 1
sc t0, 0(a0)
bnez t0, 10f /* Must have successfully set lock in memory */
sync /* [ fence in delay slot ] */
acquire_lock_retry:
pause
b acquire_lock
nop
10:
/* Critical region code */
release_lock:
sync
sw zero, 0(a0)
/* Set the software lock */
/* Try to store the software lock */
/* Branch if lock acquired successfully */
/* Wait for LLBIT to clear before retry */
/* and retry the operation */
/* Release software lock, clearing LLBIT */
/* for any PAUSEd waiters */

There are two levels of atomicity under consideration in MIPS example.
The higher-level software lock is represented by the value of the word
in memory. The lower-level hardware lock is inherent in the LL/SC
mechanism used to attempt to update the memory word value atomically.
The code example avoids attempting the SC when the software lock is
taken. The MIPS PAUSE definition is tied to the implicit LL-bit state
bit.

RISC-V has standard AMOs to acquire this kind of simple lock without
needing an LR/SC sequence (and separate fences). A
test-and-test-and-set loop is used to avoid updating the lock when it
is seen to be held, with the RISC-V PAUSE used in the read-only test
loop. Editing the sample in Figure 8.2 of the RISC-V spec to add
PAUSE:

again:
pause
acquire_lock: /* Entry point */
lw t1, (a0)
bnez t1, again
li t0, 1
amoswap.w.aq t1, t0, (a0)
bnez t1, again
/* Critical section */
amoswap.w.rl x0, x0, (a0) /* Release lock */

To mimic the "wait on event" behavior, a RISC-V implementation could
implement some microarchitectural heuristics to modulate PAUSE
duration based on activity to the locked memory location (e.g., wait
1000 cycles unless there is a memory access/cache probe to the last
read memory address). However, this does not need to be spelled out
in the RISC-V PAUSE ISA spec. The only way this constrains
implementations is that they have to guarantee finite PAUSE duration
regardless of microarchitectural activity. The MIPS PAUSE definition
allows PAUSE to hang forever if there is no activity to clear the lock
bit.

The proposed RISC-V PAUSE has a simpler definition, and also wider
application, as it can be used when spinning on a value written to
non-cached memory (e.g., a scratchpad-resident flag variable or a
volatile device register bit). The MIPS PAUSE and the proposed
PAUSE.LR assume a cached memory system or at least one supporting
probes on lock locations.

As many in this thread have noted, a more sophisticated "wait on
memory event" instruction could be a separate useful extension, but
this does not supplant simple PAUSE functionality.

Krste


On Wed, 6 Jan 2021 14:41:52 -0800, "Allen Baum" <allen.baum@esperantotech.com> said:
| I guess I am even more confused now than I was before.
| ● PAUSE.LR is only invoked if the lock that LL reads is set i.e., some other thread has set the lock.

| I'm not sure what you mean when you say pause.LLR is "invoked"
| I am guessing you mean that if executed, it will pause if a previous LR has reserved some address, and then had the reservation cancelled by someone else.?

| The semantics here don't quite match what you're assuming. There is no lock, as such. 
| If there were, then only the lock owner could unlock it  -- but a reservation and be cancelled by anyone by simply writing to any address in the reservation set.
| And LR always succeeds at making a valid (==aligned) reservation (though it can be snatched away immediately)
| So, there is no notification at the time of the LR that there was a reservation by anyone else.
| In fact, multiple harts could have simultaneous reservations - and it isn't until the first one executes an SC and succeeds that the hart can conclude that it actually "owned"
| the "lock".

| I may be misinterpreting your intent here (I'm fairly certain I must be).
| Could you write down a snippet of code (with comments) to illustrate what you mean?

| On Wed, Jan 6, 2021 at 10:06 AM Greg Favor <gfavor@ventanamicro.com> wrote:

| In line with Allen's comments, this "PAUSE.LR" is actually not so much like MONITOR/MWAIT.  It seems more like ARMv8's WFE functionality (which also includes the idea of
| events, local and global monitors, and the SEV instruction).  At the surface it seems like WFE would often be used in a similar manner and situation as PAUSE.LR.  But
| (caveat) I haven't cross-checked details.

| Greg

|


Sanjay Patel
 

Thanks Krste (and Allen and Greg).

In our RISC-V core, we intend to stage the implementation of the A-extension. First LR/SC, then the AMOs. Which means that for a short duration only LR/SC will be available to acquire locks as in MIPS. From the perspective of the RISC-V software stack is there any hard dependency that prevents us from implementing the A-extension in this manner for Linux based 64-bit multi-core platform?

One other question. Can the RISC-V Pause be substituted for the MIPS Pause below? I think so, but I'd like to have feedback from the RISC-V experts.

Sanjay

On 1/6/21, 4:27 PM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:


MIPS uses LL/SC to acquire locks, versus RISC-V which has AMOs.

To help answer Allen's question, from MIPS manual entry on PAUSE:

acquire_lock:
ll t0, 0(a0) /* Read software lock, set hardware lock */
bnez t0, acquire_lock_retry:/* Branch if software lock is taken */
addiu t0, t0, 1
sc t0, 0(a0)
bnez t0, 10f /* Must have successfully set lock in memory */
sync /* [ fence in delay slot ] */
acquire_lock_retry:
pause
b acquire_lock
nop
10:
/* Critical region code */
release_lock:
sync
sw zero, 0(a0)
/* Set the software lock */
/* Try to store the software lock */
/* Branch if lock acquired successfully */
/* Wait for LLBIT to clear before retry */
/* and retry the operation */
/* Release software lock, clearing LLBIT */
/* for any PAUSEd waiters */

There are two levels of atomicity under consideration in MIPS example.
The higher-level software lock is represented by the value of the word
in memory. The lower-level hardware lock is inherent in the LL/SC
mechanism used to attempt to update the memory word value atomically.
The code example avoids attempting the SC when the software lock is
taken. The MIPS PAUSE definition is tied to the implicit LL-bit state
bit.

RISC-V has standard AMOs to acquire this kind of simple lock without
needing an LR/SC sequence (and separate fences). A
test-and-test-and-set loop is used to avoid updating the lock when it
is seen to be held, with the RISC-V PAUSE used in the read-only test
loop. Editing the sample in Figure 8.2 of the RISC-V spec to add
PAUSE:

again:
pause
acquire_lock: /* Entry point */
lw t1, (a0)
bnez t1, again
li t0, 1
amoswap.w.aq t1, t0, (a0)
bnez t1, again
/* Critical section */
amoswap.w.rl x0, x0, (a0) /* Release lock */

To mimic the "wait on event" behavior, a RISC-V implementation could
implement some microarchitectural heuristics to modulate PAUSE
duration based on activity to the locked memory location (e.g., wait
1000 cycles unless there is a memory access/cache probe to the last
read memory address). However, this does not need to be spelled out
in the RISC-V PAUSE ISA spec. The only way this constrains
implementations is that they have to guarantee finite PAUSE duration
regardless of microarchitectural activity. The MIPS PAUSE definition
allows PAUSE to hang forever if there is no activity to clear the lock
bit.

The proposed RISC-V PAUSE has a simpler definition, and also wider
application, as it can be used when spinning on a value written to
non-cached memory (e.g., a scratchpad-resident flag variable or a
volatile device register bit). The MIPS PAUSE and the proposed
PAUSE.LR assume a cached memory system or at least one supporting
probes on lock locations.

As many in this thread have noted, a more sophisticated "wait on
memory event" instruction could be a separate useful extension, but
this does not supplant simple PAUSE functionality.

Krste


>>>>> On Wed, 6 Jan 2021 14:41:52 -0800, "Allen Baum" <allen.baum@esperantotech.com> said:

| I guess I am even more confused now than I was before.
| ● PAUSE.LR is only invoked if the lock that LL reads is set i.e., some other thread has set the lock.

| I'm not sure what you mean when you say pause.LLR is "invoked"
| I am guessing you mean that if executed, it will pause if a previous LR has reserved some address, and then had the reservation cancelled by someone else.?

| The semantics here don't quite match what you're assuming. There is no lock, as such.
| If there were, then only the lock owner could unlock it -- but a reservation and be cancelled by anyone by simply writing to any address in the reservation set.
| And LR always succeeds at making a valid (==aligned) reservation (though it can be snatched away immediately)
| So, there is no notification at the time of the LR that there was a reservation by anyone else.
| In fact, multiple harts could have simultaneous reservations - and it isn't until the first one executes an SC and succeeds that the hart can conclude that it actually "owned"
| the "lock".

| I may be misinterpreting your intent here (I'm fairly certain I must be).
| Could you write down a snippet of code (with comments) to illustrate what you mean?

| On Wed, Jan 6, 2021 at 10:06 AM Greg Favor <gfavor@ventanamicro.com> wrote:

| In line with Allen's comments, this "PAUSE.LR" is actually not so much like MONITOR/MWAIT. It seems more like ARMv8's WFE functionality (which also includes the idea of
| events, local and global monitors, and the SEV instruction). At the surface it seems like WFE would often be used in a similar manner and situation as PAUSE.LR. But
| (caveat) I haven't cross-checked details.

| Greg

|


Krste Asanovic
 

On Thu, 7 Jan 2021 05:39:00 +0000, "Sanjay Patel" <spatel@wavecomp.com> said:
| Thanks Krste (and Allen and Greg).
| In our RISC-V core, we intend to stage the implementation of the
| A-extension. First LR/SC, then the AMOs. Which means that for a
| short duration only LR/SC will be available to acquire locks as in
| MIPS. From the perspective of the RISC-V software stack is there any
| hard dependency that prevents us from implementing the A-extension
| in this manner for Linux based 64-bit multi-core platform?

If you compile your own Linux and apps you can make this work, but
standard Linux distributions will expect AMOs to be implemented.

| One other question. Can the RISC-V Pause be substituted for the MIPS Pause below? I think so, but I'd like to have feedback from the RISC-V experts.

Yes, and you can implement the exact same semantics as MIPS PAUSE to
end PAUSE on LL-bit activity as a microarch optimization, but must
ensure can't PAUSE indefinitely on RISC-V.

Krste


| Sanjay

| On 1/6/21, 4:27 PM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:


| MIPS uses LL/SC to acquire locks, versus RISC-V which has AMOs.

| To help answer Allen's question, from MIPS manual entry on PAUSE:

| acquire_lock:
| ll t0, 0(a0) /* Read software lock, set hardware lock */
| bnez t0, acquire_lock_retry:/* Branch if software lock is taken */
| addiu t0, t0, 1
| sc t0, 0(a0)
| bnez t0, 10f /* Must have successfully set lock in memory */
| sync /* [ fence in delay slot ] */
| acquire_lock_retry:
| pause
| b acquire_lock
| nop
| 10:
| /* Critical region code */
| release_lock:
| sync
| sw zero, 0(a0)
| /* Set the software lock */
| /* Try to store the software lock */
| /* Branch if lock acquired successfully */
| /* Wait for LLBIT to clear before retry */
| /* and retry the operation */
| /* Release software lock, clearing LLBIT */
| /* for any PAUSEd waiters */

| There are two levels of atomicity under consideration in MIPS example.
| The higher-level software lock is represented by the value of the word
| in memory. The lower-level hardware lock is inherent in the LL/SC
| mechanism used to attempt to update the memory word value atomically.
| The code example avoids attempting the SC when the software lock is
| taken. The MIPS PAUSE definition is tied to the implicit LL-bit state
| bit.

| RISC-V has standard AMOs to acquire this kind of simple lock without
| needing an LR/SC sequence (and separate fences). A
| test-and-test-and-set loop is used to avoid updating the lock when it
| is seen to be held, with the RISC-V PAUSE used in the read-only test
| loop. Editing the sample in Figure 8.2 of the RISC-V spec to add
| PAUSE:

| again:
| pause
| acquire_lock: /* Entry point */
| lw t1, (a0)
| bnez t1, again
| li t0, 1
| amoswap.w.aq t1, t0, (a0)
| bnez t1, again
| /* Critical section */
| amoswap.w.rl x0, x0, (a0) /* Release lock */

| To mimic the "wait on event" behavior, a RISC-V implementation could
| implement some microarchitectural heuristics to modulate PAUSE
| duration based on activity to the locked memory location (e.g., wait
| 1000 cycles unless there is a memory access/cache probe to the last
| read memory address). However, this does not need to be spelled out
| in the RISC-V PAUSE ISA spec. The only way this constrains
| implementations is that they have to guarantee finite PAUSE duration
| regardless of microarchitectural activity. The MIPS PAUSE definition
| allows PAUSE to hang forever if there is no activity to clear the lock
| bit.

| The proposed RISC-V PAUSE has a simpler definition, and also wider
| application, as it can be used when spinning on a value written to
| non-cached memory (e.g., a scratchpad-resident flag variable or a
| volatile device register bit). The MIPS PAUSE and the proposed
| PAUSE.LR assume a cached memory system or at least one supporting
| probes on lock locations.

| As many in this thread have noted, a more sophisticated "wait on
| memory event" instruction could be a separate useful extension, but
| this does not supplant simple PAUSE functionality.

| Krste


|||||| On Wed, 6 Jan 2021 14:41:52 -0800, "Allen Baum" <allen.baum@esperantotech.com> said:

| | I guess I am even more confused now than I was before.
| | ● PAUSE.LR is only invoked if the lock that LL reads is set i.e., some other thread has set the lock.

| | I'm not sure what you mean when you say pause.LLR is "invoked"
| | I am guessing you mean that if executed, it will pause if a previous LR has reserved some address, and then had the reservation cancelled by someone else.?

| | The semantics here don't quite match what you're assuming. There is no lock, as such.
| | If there were, then only the lock owner could unlock it -- but a reservation and be cancelled by anyone by simply writing to any address in the reservation set.
| | And LR always succeeds at making a valid (==aligned) reservation (though it can be snatched away immediately)
| | So, there is no notification at the time of the LR that there was a reservation by anyone else.
| | In fact, multiple harts could have simultaneous reservations - and it isn't until the first one executes an SC and succeeds that the hart can conclude that it actually "owned"
| | the "lock".

| | I may be misinterpreting your intent here (I'm fairly certain I must be).
| | Could you write down a snippet of code (with comments) to illustrate what you mean?

| | On Wed, Jan 6, 2021 at 10:06 AM Greg Favor <gfavor@ventanamicro.com> wrote:

| | In line with Allen's comments, this "PAUSE.LR" is actually not so much like MONITOR/MWAIT. It seems more like ARMv8's WFE functionality (which also includes the idea of
| | events, local and global monitors, and the SEV instruction). At the surface it seems like WFE would often be used in a similar manner and situation as PAUSE.LR. But
| | (caveat) I haven't cross-checked details.

| | Greg

| |



|


Sanjay Patel
 

Thanks Krste.

This is helpful and we can map MIPS PAUSE behavior to RISC-V PAUSE.
<snip>
Yes, and you can implement the exact same semantics as MIPS PAUSE to
end PAUSE on LL-bit activity as a microarch optimization, but must
ensure can't PAUSE indefinitely on RISC-V.
<snip>

Your sentence also implies we need to implement a counter of similar to ensure an exit, if not on the basis of a snoop event.
This isn't clear. If the lock is never cleared for some reason, the counter expiring would cause the code to loop again to check the reservation, but it would not gracefully terminate.
So basically I'd like to understand how the stated requirement helps.

Sanjay

On 1/7/21, 12:23 AM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:



>>>>> On Thu, 7 Jan 2021 05:39:00 +0000, "Sanjay Patel" <spatel@wavecomp.com> said:

| Thanks Krste (and Allen and Greg).
| In our RISC-V core, we intend to stage the implementation of the
| A-extension. First LR/SC, then the AMOs. Which means that for a
| short duration only LR/SC will be available to acquire locks as in
| MIPS. From the perspective of the RISC-V software stack is there any
| hard dependency that prevents us from implementing the A-extension
| in this manner for Linux based 64-bit multi-core platform?

If you compile your own Linux and apps you can make this work, but
standard Linux distributions will expect AMOs to be implemented.

| One other question. Can the RISC-V Pause be substituted for the MIPS Pause below? I think so, but I'd like to have feedback from the RISC-V experts.

Yes, and you can implement the exact same semantics as MIPS PAUSE to
end PAUSE on LL-bit activity as a microarch optimization, but must
ensure can't PAUSE indefinitely on RISC-V.

Krste


| Sanjay

| On 1/6/21, 4:27 PM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:


| MIPS uses LL/SC to acquire locks, versus RISC-V which has AMOs.

| To help answer Allen's question, from MIPS manual entry on PAUSE:

| acquire_lock:
| ll t0, 0(a0) /* Read software lock, set hardware lock */
| bnez t0, acquire_lock_retry:/* Branch if software lock is taken */
| addiu t0, t0, 1
| sc t0, 0(a0)
| bnez t0, 10f /* Must have successfully set lock in memory */
| sync /* [ fence in delay slot ] */
| acquire_lock_retry:
| pause
| b acquire_lock
| nop
| 10:
| /* Critical region code */
| release_lock:
| sync
| sw zero, 0(a0)
| /* Set the software lock */
| /* Try to store the software lock */
| /* Branch if lock acquired successfully */
| /* Wait for LLBIT to clear before retry */
| /* and retry the operation */
| /* Release software lock, clearing LLBIT */
| /* for any PAUSEd waiters */

| There are two levels of atomicity under consideration in MIPS example.
| The higher-level software lock is represented by the value of the word
| in memory. The lower-level hardware lock is inherent in the LL/SC
| mechanism used to attempt to update the memory word value atomically.
| The code example avoids attempting the SC when the software lock is
| taken. The MIPS PAUSE definition is tied to the implicit LL-bit state
| bit.

| RISC-V has standard AMOs to acquire this kind of simple lock without
| needing an LR/SC sequence (and separate fences). A
| test-and-test-and-set loop is used to avoid updating the lock when it
| is seen to be held, with the RISC-V PAUSE used in the read-only test
| loop. Editing the sample in Figure 8.2 of the RISC-V spec to add
| PAUSE:

| again:
| pause
| acquire_lock: /* Entry point */
| lw t1, (a0)
| bnez t1, again
| li t0, 1
| amoswap.w.aq t1, t0, (a0)
| bnez t1, again
| /* Critical section */
| amoswap.w.rl x0, x0, (a0) /* Release lock */

| To mimic the "wait on event" behavior, a RISC-V implementation could
| implement some microarchitectural heuristics to modulate PAUSE
| duration based on activity to the locked memory location (e.g., wait
| 1000 cycles unless there is a memory access/cache probe to the last
| read memory address). However, this does not need to be spelled out
| in the RISC-V PAUSE ISA spec. The only way this constrains
| implementations is that they have to guarantee finite PAUSE duration
| regardless of microarchitectural activity. The MIPS PAUSE definition
| allows PAUSE to hang forever if there is no activity to clear the lock
| bit.

| The proposed RISC-V PAUSE has a simpler definition, and also wider
| application, as it can be used when spinning on a value written to
| non-cached memory (e.g., a scratchpad-resident flag variable or a
| volatile device register bit). The MIPS PAUSE and the proposed
| PAUSE.LR assume a cached memory system or at least one supporting
| probes on lock locations.

| As many in this thread have noted, a more sophisticated "wait on
| memory event" instruction could be a separate useful extension, but
| this does not supplant simple PAUSE functionality.

| Krste


|||||| On Wed, 6 Jan 2021 14:41:52 -0800, "Allen Baum" <allen.baum@esperantotech.com> said:

| | I guess I am even more confused now than I was before.
| | ● PAUSE.LR is only invoked if the lock that LL reads is set i.e., some other thread has set the lock.

| | I'm not sure what you mean when you say pause.LLR is "invoked"
| | I am guessing you mean that if executed, it will pause if a previous LR has reserved some address, and then had the reservation cancelled by someone else.?

| | The semantics here don't quite match what you're assuming. There is no lock, as such.
| | If there were, then only the lock owner could unlock it -- but a reservation and be cancelled by anyone by simply writing to any address in the reservation set.
| | And LR always succeeds at making a valid (==aligned) reservation (though it can be snatched away immediately)
| | So, there is no notification at the time of the LR that there was a reservation by anyone else.
| | In fact, multiple harts could have simultaneous reservations - and it isn't until the first one executes an SC and succeeds that the hart can conclude that it actually "owned"
| | the "lock".

| | I may be misinterpreting your intent here (I'm fairly certain I must be).
| | Could you write down a snippet of code (with comments) to illustrate what you mean?

| | On Wed, Jan 6, 2021 at 10:06 AM Greg Favor <gfavor@ventanamicro.com> wrote:

| | In line with Allen's comments, this "PAUSE.LR" is actually not so much like MONITOR/MWAIT. It seems more like ARMv8's WFE functionality (which also includes the idea of
| | events, local and global monitors, and the SEV instruction). At the surface it seems like WFE would often be used in a similar manner and situation as PAUSE.LR. But
| | (caveat) I haven't cross-checked details.

| | Greg

| |



|


Krste Asanovic
 

On Thu, 7 Jan 2021 16:52:35 +0000, "Sanjay Patel" <spatel@wavecomp.com> said:
| Thanks Krste.
| This is helpful and we can map MIPS PAUSE behavior to RISC-V PAUSE.
| <snip>
| Yes, and you can implement the exact same semantics as MIPS PAUSE to
| end PAUSE on LL-bit activity as a microarch optimization, but must
| ensure can't PAUSE indefinitely on RISC-V.
| <snip>

| Your sentence also implies we need to implement a counter of similar to ensure an exit, if not on the basis of a snoop event.
| This isn't clear. If the lock is never cleared for some reason, the counter expiring would cause the code to loop again to check the reservation, but it would not gracefully terminate.
| So basically I'd like to understand how the stated requirement helps.

If the RISC-V PAUSE could hang indefinitely on any use, it would be
impossible to prove progress of some algorithms, unless there is some
environmental mechanism guaranteed to cause PAUSE to resume. These
environmental mechanisms would be hard to define in a common way
across all the systems RISC-V is supporting.

In your particular scenario, where PAUSE has semantics of exiting on
access to a specified memory location or set of locations, and the
lock protocol is known, you can build confidence you don't need a
timeout to ensure progress.

However, if the PAUSE is in a loop waiting for some condition where it
is not easily determined by the core the condition has been satisfied
(e.g., a bit in a device register), then a potentially indefinite
PAUSE is unusable.

I'm not opposed to adding specific "WAIT-FOR-X" instructions in
addition to PAUSE, but they can't replace the use of simple PAUSE when
waiting for events whose trigger is not available to the core. The
WAIT-FOR instructions will also generally have more complex and
system-specific semantics.

Krste

| Sanjay


Sanjay Patel
 

Hi Krste,

 

One last try to try to unify the definitions. 😊 This is to avoid creating a custom WAIT-FOR-X, equivalent to MIPS PAUSE.

 

I’ve reformatted the MIPS code you had included such that the operation with LL(LR)/SC is clear in the context of PAUSE use.

 

acquire_lock:

         ll t0, 0(a0)     

         bnez t0, acquire_lock_retry

         addiu t0, t0, 1

         sc t0, 0(a0)

         bnez t0, 10f

         sync

acquire_lock_retry:

         pause

         b acquire_lock

10:

         Critical region code

release_lock:

         sync

         sw zero, 0(a0)

 

A MIPS implementation will set a hardware “LLbit” on execution of LL. We can implement a bi-modal RISC-V PAUSE that waits on the snoop if LLbit is set. If LLbit is not set, then PAUSE waits on the timer expiry. I do think this is a viable solution as it provides the same guarantee of exit as the timer expiry.

 

On the other hand, if you think it is architecturally messy to implement the bi-modal behavior, then we can implement a custom WAIT-FOR-X for LR/SC, and transition this to a RISC-V WAIT-FOR-X when included in the RISC-V architecture.

 

Sanjay

 

 

On 1/8/21, 6:42 PM, "krste@..." <krste@...> wrote:

 

 

    >>>>> On Thu, 7 Jan 2021 16:52:35 +0000, "Sanjay Patel" <spatel@...> said:

 

    | Thanks Krste.

    | This is helpful and we can map MIPS PAUSE behavior to RISC-V PAUSE.

    | <snip>

    | Yes, and you can implement the exact same semantics as MIPS PAUSE to

    |     end PAUSE on LL-bit activity as a microarch optimization, but must

    |     ensure can't PAUSE indefinitely on RISC-V.

    | <snip>

 

    | Your sentence also implies we need to implement a counter of similar to ensure an exit, if not on the basis of a snoop event.

    | This isn't clear. If the lock is never cleared for some reason, the counter expiring would cause the code to loop again to check the reservation, but it would not gracefully terminate.

    | So basically I'd like to understand how the stated requirement helps.

 

    If the RISC-V PAUSE could hang indefinitely on any use, it would be

    impossible to prove progress of some algorithms, unless there is some

    environmental mechanism guaranteed to cause PAUSE to resume.  These

    environmental mechanisms would be hard to define in a common way

    across all the systems RISC-V is supporting.

 

    In your particular scenario, where PAUSE has semantics of exiting on

    access to a specified memory location or set of locations, and the

    lock protocol is known, you can build confidence you don't need a

    timeout to ensure progress.

 

    However, if the PAUSE is in a loop waiting for some condition where it

    is not easily determined by the core the condition has been satisfied

    (e.g., a bit in a device register), then a potentially indefinite

    PAUSE is unusable.

 

    I'm not opposed to adding specific "WAIT-FOR-X" instructions in

    addition to PAUSE, but they can't replace the use of simple PAUSE when

    waiting for events whose trigger is not available to the core.  The

    WAIT-FOR instructions will also generally have more complex and

    system-specific semantics.

 

    Krste

 

    | Sanjay

 


Krste Asanovic
 

As far as I can tell, the only advantage of the MIPS-style PAUSE over
the RISC-V proposed PAUSE would be to remove the need for a timeout
counter in the microarchitecture. But in the bimodal form you've
given, you still need the timeout counter, but have also stopped PAUSE
from being a HINT (and added definition of LL-bit to LR/SC).

I don't believe the reduction in polling overhead is a significant
factor, as an implementation can set the default PAUSE duration such
that this is sufficiently low.

What is the objection to the PAUSE occasionally causing a retry and a
re-PAUSE when timer expires and LL-bit is set? This will happen
anyway if there are interrupts.

Krste



On Mon, 11 Jan 2021 18:01:38 +0000, Sanjay Patel <spatel@wavecomp.com> said:
| Hi Krste,
| One last try to try to unify the definitions. 😊 This is to avoid creating a custom WAIT-FOR-X, equivalent to MIPS PAUSE.

| I’ve reformatted the MIPS code you had included such that the operation with LL(LR)/SC is clear in the context of PAUSE use.

| acquire_lock:

| ll t0, 0(a0)

| bnez t0, acquire_lock_retry

| addiu t0, t0, 1

| sc t0, 0(a0)

| bnez t0, 10f

| sync

| acquire_lock_retry:

| pause

| b acquire_lock

| 10:

| Critical region code

| release_lock:

| sync

| sw zero, 0(a0)

| A MIPS implementation will set a hardware “LLbit” on execution of LL. We can implement a bi-modal RISC-V PAUSE that waits on the snoop if LLbit is
| set. If LLbit is not set, then PAUSE waits on the timer expiry. I do think this is a viable solution as it provides the same guarantee of exit as
| the timer expiry.

| On the other hand, if you think it is architecturally messy to implement the bi-modal behavior, then we can implement a custom WAIT-FOR-X for LR/
| SC, and transition this to a RISC-V WAIT-FOR-X when included in the RISC-V architecture.

| Sanjay

| On 1/8/21, 6:42 PM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:

|||||| On Thu, 7 Jan 2021 16:52:35 +0000, "Sanjay Patel" <spatel@wavecomp.com> said:

| | Thanks Krste.

| | This is helpful and we can map MIPS PAUSE behavior to RISC-V PAUSE.

| | <snip>

| | Yes, and you can implement the exact same semantics as MIPS PAUSE to

| | end PAUSE on LL-bit activity as a microarch optimization, but must

| | ensure can't PAUSE indefinitely on RISC-V.

| | <snip>

| | Your sentence also implies we need to implement a counter of similar to ensure an exit, if not on the basis of a snoop event.

| | This isn't clear. If the lock is never cleared for some reason, the counter expiring would cause the code to loop again to check the
| reservation, but it would not gracefully terminate.

| | So basically I'd like to understand how the stated requirement helps.

| If the RISC-V PAUSE could hang indefinitely on any use, it would be

| impossible to prove progress of some algorithms, unless there is some

| environmental mechanism guaranteed to cause PAUSE to resume. These

| environmental mechanisms would be hard to define in a common way

| across all the systems RISC-V is supporting.

| In your particular scenario, where PAUSE has semantics of exiting on

| access to a specified memory location or set of locations, and the

| lock protocol is known, you can build confidence you don't need a

| timeout to ensure progress.

| However, if the PAUSE is in a loop waiting for some condition where it

| is not easily determined by the core the condition has been satisfied

| (e.g., a bit in a device register), then a potentially indefinite

| PAUSE is unusable.

| I'm not opposed to adding specific "WAIT-FOR-X" instructions in

| addition to PAUSE, but they can't replace the use of simple PAUSE when

| waiting for events whose trigger is not available to the core. The

| WAIT-FOR instructions will also generally have more complex and

| system-specific semantics.

| Krste

| | Sanjay


Sanjay Patel
 

I am partial towards MIPS PAUSE for LR/SC because it is optimized for the specific event that should clear the reservation, the snoop. Further, the response doesn't have to be "sized" as in the case of the RISC-V PAUSE timer, as the response is in response to the number of threads contending for the lock. i.e., the MIPS PAUSE handling just works out of the box. I do agree though that it is specific to LR/SC and not applicable to the AMOs. But that's why I wanted it bi-modal, to handle the two cases.

Is it recommended that the RISC-V timer be programmable? If so, is it better to be an M-mode register? The reason I say M-mode is that if the count can be programmed arbitrarily by any thread, then there is the risk of livelocks since a thread that has a short cycle count will poll the lock more frequently. While the timer is not specified directly in the specification, I think it should as a possible or recommended implementation.

Also, how does a software programmer assess the value of counter? I would assume it is the latency from the core to the point of shared coherency.

Other rules may have to be specified for PAUSE. Does an interrupt cause the timer to be cleared? I think it should.

(Btw, the concerns I raise could just be a matter of good engineering. )

I'm not clear how the definition I've provided prevents RISC-V from being use as a HINT. HINTS should not modify architecture state, and implementations can choose to NOP the instruction. I don't think I've violated either rule.

If you can provide some clarity regarding my questions about RISC-V PAUSE then we might be able to make it work as is. I think it is the ambiguity that drives me towards the familiar MIPS PAUSE.

Sanjay

On 1/14/21, 8:31 AM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:


As far as I can tell, the only advantage of the MIPS-style PAUSE over
the RISC-V proposed PAUSE would be to remove the need for a timeout
counter in the microarchitecture. But in the bimodal form you've
given, you still need the timeout counter, but have also stopped PAUSE
from being a HINT (and added definition of LL-bit to LR/SC).

I don't believe the reduction in polling overhead is a significant
factor, as an implementation can set the default PAUSE duration such
that this is sufficiently low.

What is the objection to the PAUSE occasionally causing a retry and a
re-PAUSE when timer expires and LL-bit is set? This will happen
anyway if there are interrupts.

Krste



>>>>> On Mon, 11 Jan 2021 18:01:38 +0000, Sanjay Patel <spatel@wavecomp.com> said:

| Hi Krste,
| One last try to try to unify the definitions. 😊 This is to avoid creating a custom WAIT-FOR-X, equivalent to MIPS PAUSE.

| I’ve reformatted the MIPS code you had included such that the operation with LL(LR)/SC is clear in the context of PAUSE use.

| acquire_lock:

| ll t0, 0(a0)

| bnez t0, acquire_lock_retry

| addiu t0, t0, 1

| sc t0, 0(a0)

| bnez t0, 10f

| sync

| acquire_lock_retry:

| pause

| b acquire_lock

| 10:

| Critical region code

| release_lock:

| sync

| sw zero, 0(a0)

| A MIPS implementation will set a hardware “LLbit” on execution of LL. We can implement a bi-modal RISC-V PAUSE that waits on the snoop if LLbit is
| set. If LLbit is not set, then PAUSE waits on the timer expiry. I do think this is a viable solution as it provides the same guarantee of exit as
| the timer expiry.

| On the other hand, if you think it is architecturally messy to implement the bi-modal behavior, then we can implement a custom WAIT-FOR-X for LR/
| SC, and transition this to a RISC-V WAIT-FOR-X when included in the RISC-V architecture.

| Sanjay

| On 1/8/21, 6:42 PM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:

|||||| On Thu, 7 Jan 2021 16:52:35 +0000, "Sanjay Patel" <spatel@wavecomp.com> said:

| | Thanks Krste.

| | This is helpful and we can map MIPS PAUSE behavior to RISC-V PAUSE.

| | <snip>

| | Yes, and you can implement the exact same semantics as MIPS PAUSE to

| | end PAUSE on LL-bit activity as a microarch optimization, but must

| | ensure can't PAUSE indefinitely on RISC-V.

| | <snip>

| | Your sentence also implies we need to implement a counter of similar to ensure an exit, if not on the basis of a snoop event.

| | This isn't clear. If the lock is never cleared for some reason, the counter expiring would cause the code to loop again to check the
| reservation, but it would not gracefully terminate.

| | So basically I'd like to understand how the stated requirement helps.

| If the RISC-V PAUSE could hang indefinitely on any use, it would be

| impossible to prove progress of some algorithms, unless there is some

| environmental mechanism guaranteed to cause PAUSE to resume. These

| environmental mechanisms would be hard to define in a common way

| across all the systems RISC-V is supporting.

| In your particular scenario, where PAUSE has semantics of exiting on

| access to a specified memory location or set of locations, and the

| lock protocol is known, you can build confidence you don't need a

| timeout to ensure progress.

| However, if the PAUSE is in a loop waiting for some condition where it

| is not easily determined by the core the condition has been satisfied

| (e.g., a bit in a device register), then a potentially indefinite

| PAUSE is unusable.

| I'm not opposed to adding specific "WAIT-FOR-X" instructions in

| addition to PAUSE, but they can't replace the use of simple PAUSE when

| waiting for events whose trigger is not available to the core. The

| WAIT-FOR instructions will also generally have more complex and

| system-specific semantics.

| Krste

| | Sanjay


Krste Asanovic
 

On Fri, 15 Jan 2021 17:14:23 +0000, Sanjay Patel <spatel@wavecomp.com> said:
| I am partial towards MIPS PAUSE for LR/SC because it is optimized for the specific event that should clear the reservation, the snoop. Further, the response doesn't have to be "sized" as in the case of the RISC-V PAUSE timer, as the response is in response to the number of threads contending for the lock. i.e., the MIPS PAUSE handling just works out of the box. I do agree though that it is specific to LR/SC and not applicable to the AMOs. But that's why I wanted it bi-modal, to handle the two cases.

You can implement the LL-bit as a purely microarchitectural
optimization for your use case. For example, you can use two
different timeouts for the case where LL-bit is set or not. I believe
this captures almost all of the benefits of the MIPS approach while
avoiding adding complexity to the RISC-V ISA spec.

| Is it recommended that the RISC-V timer be programmable? If so, is it better to be an M-mode register? The reason I say M-mode is that if the count can be programmed arbitrarily by any thread, then there is the risk of livelocks since a thread that has a short cycle count will poll the lock more frequently. While the timer is not specified directly in the specification, I think it should as a possible or recommended implementation.

| Also, how does a software programmer assess the value of counter? I would assume it is the latency from the core to the point of shared coherency.
| Other rules may have to be specified for PAUSE. Does an interrupt cause the timer to be cleared? I think it should.
| (Btw, the concerns I raise could just be a matter of good engineering. )

Software locking code can add adaptive backoff loop if required and
with no need to involve either more privileged layers or additional
hardware (neither of which will understand what application software
is actually trying to do).

| I'm not clear how the definition I've provided prevents RISC-V from being use as a HINT. HINTS should not modify architecture state, and implementations can choose to NOP the instruction. I don't think I've violated either rule.

The MIPS PAUSE mandates the existence of an LL-bit, and has behavior
that depends on this LL-bit. I realize that the MIPS spec does not
guarantee that threads will stop at a PAUSE with LL-bit set, but this
is allowed by spec and software could rely on it, or more importantly,
software cannot rely on PAUSE being exited unless LL-bit is cleared.
This is why it is not really a HINT, even though a particular hardware
implementation could implement it as a NOP.

| If you can provide some clarity regarding my questions about RISC-V PAUSE then we might be able to make it work as is. I think it is the ambiguity that drives me towards the familiar MIPS PAUSE.

Hope this helps,
Krste


| Sanjay

| On 1/14/21, 8:31 AM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:


| As far as I can tell, the only advantage of the MIPS-style PAUSE over
| the RISC-V proposed PAUSE would be to remove the need for a timeout
| counter in the microarchitecture. But in the bimodal form you've
| given, you still need the timeout counter, but have also stopped PAUSE
| from being a HINT (and added definition of LL-bit to LR/SC).

| I don't believe the reduction in polling overhead is a significant
| factor, as an implementation can set the default PAUSE duration such
| that this is sufficiently low.

| What is the objection to the PAUSE occasionally causing a retry and a
| re-PAUSE when timer expires and LL-bit is set? This will happen
| anyway if there are interrupts.

| Krste



|||||| On Mon, 11 Jan 2021 18:01:38 +0000, Sanjay Patel <spatel@wavecomp.com> said:

| | Hi Krste,
| | One last try to try to unify the definitions. 😊 This is to avoid creating a custom WAIT-FOR-X, equivalent to MIPS PAUSE.

| | I’ve reformatted the MIPS code you had included such that the operation with LL(LR)/SC is clear in the context of PAUSE use.

| | acquire_lock:

| | ll t0, 0(a0)

| | bnez t0, acquire_lock_retry

| | addiu t0, t0, 1

| | sc t0, 0(a0)

| | bnez t0, 10f

| | sync

| | acquire_lock_retry:

| | pause

| | b acquire_lock

| | 10:

| | Critical region code

| | release_lock:

| | sync

| | sw zero, 0(a0)

| | A MIPS implementation will set a hardware “LLbit” on execution of LL. We can implement a bi-modal RISC-V PAUSE that waits on the snoop if LLbit is
| | set. If LLbit is not set, then PAUSE waits on the timer expiry. I do think this is a viable solution as it provides the same guarantee of exit as
| | the timer expiry.

| | On the other hand, if you think it is architecturally messy to implement the bi-modal behavior, then we can implement a custom WAIT-FOR-X for LR/
| | SC, and transition this to a RISC-V WAIT-FOR-X when included in the RISC-V architecture.

| | Sanjay

| | On 1/8/21, 6:42 PM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:

| |||||| On Thu, 7 Jan 2021 16:52:35 +0000, "Sanjay Patel" <spatel@wavecomp.com> said:

| | | Thanks Krste.

| | | This is helpful and we can map MIPS PAUSE behavior to RISC-V PAUSE.

| | | <snip>

| | | Yes, and you can implement the exact same semantics as MIPS PAUSE to

| | | end PAUSE on LL-bit activity as a microarch optimization, but must

| | | ensure can't PAUSE indefinitely on RISC-V.

| | | <snip>

| | | Your sentence also implies we need to implement a counter of similar to ensure an exit, if not on the basis of a snoop event.

| | | This isn't clear. If the lock is never cleared for some reason, the counter expiring would cause the code to loop again to check the
| | reservation, but it would not gracefully terminate.

| | | So basically I'd like to understand how the stated requirement helps.

| | If the RISC-V PAUSE could hang indefinitely on any use, it would be

| | impossible to prove progress of some algorithms, unless there is some

| | environmental mechanism guaranteed to cause PAUSE to resume. These

| | environmental mechanisms would be hard to define in a common way

| | across all the systems RISC-V is supporting.

| | In your particular scenario, where PAUSE has semantics of exiting on

| | access to a specified memory location or set of locations, and the

| | lock protocol is known, you can build confidence you don't need a

| | timeout to ensure progress.

| | However, if the PAUSE is in a loop waiting for some condition where it

| | is not easily determined by the core the condition has been satisfied

| | (e.g., a bit in a device register), then a potentially indefinite

| | PAUSE is unusable.

| | I'm not opposed to adding specific "WAIT-FOR-X" instructions in

| | addition to PAUSE, but they can't replace the use of simple PAUSE when

| | waiting for events whose trigger is not available to the core. The

| | WAIT-FOR instructions will also generally have more complex and

| | system-specific semantics.

| | Krste

| | | Sanjay


Sanjay Patel
 

Thanks Krste.

With your feedback, I'll be able to provide a definition/implementation of RISC-V PAUSE that works like MIPS PAUSE for LR/SC, but otherwise conforms to the current definition of RISC-V PAUSE for all other cases such as AMOs, or simply a standalone delay. In all cases it will be architecturally compliant by providing a bounded exit for PAUSE.

I also realized that the delay can be fixed and does not have to factor in latency to shared memory and tuned to the critical code itself. It should be seen as a pipeline characteristic with the goal of reducing activity while a hart spins on the lock. It thus does not have to be programmable.

I'll provide a summary later of handling in case detail should be added to the description.

Sanjay

PS: Thanks for Allen Baum's background help.

On 1/15/21, 12:16 PM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:


>>>>> On Fri, 15 Jan 2021 17:14:23 +0000, Sanjay Patel <spatel@wavecomp.com> said:

| I am partial towards MIPS PAUSE for LR/SC because it is optimized for the specific event that should clear the reservation, the snoop. Further, the response doesn't have to be "sized" as in the case of the RISC-V PAUSE timer, as the response is in response to the number of threads contending for the lock. i.e., the MIPS PAUSE handling just works out of the box. I do agree though that it is specific to LR/SC and not applicable to the AMOs. But that's why I wanted it bi-modal, to handle the two cases.

You can implement the LL-bit as a purely microarchitectural
optimization for your use case. For example, you can use two
different timeouts for the case where LL-bit is set or not. I believe
this captures almost all of the benefits of the MIPS approach while
avoiding adding complexity to the RISC-V ISA spec.

| Is it recommended that the RISC-V timer be programmable? If so, is it better to be an M-mode register? The reason I say M-mode is that if the count can be programmed arbitrarily by any thread, then there is the risk of livelocks since a thread that has a short cycle count will poll the lock more frequently. While the timer is not specified directly in the specification, I think it should as a possible or recommended implementation.

| Also, how does a software programmer assess the value of counter? I would assume it is the latency from the core to the point of shared coherency.
| Other rules may have to be specified for PAUSE. Does an interrupt cause the timer to be cleared? I think it should.
| (Btw, the concerns I raise could just be a matter of good engineering. )

Software locking code can add adaptive backoff loop if required and
with no need to involve either more privileged layers or additional
hardware (neither of which will understand what application software
is actually trying to do).

| I'm not clear how the definition I've provided prevents RISC-V from being use as a HINT. HINTS should not modify architecture state, and implementations can choose to NOP the instruction. I don't think I've violated either rule.

The MIPS PAUSE mandates the existence of an LL-bit, and has behavior
that depends on this LL-bit. I realize that the MIPS spec does not
guarantee that threads will stop at a PAUSE with LL-bit set, but this
is allowed by spec and software could rely on it, or more importantly,
software cannot rely on PAUSE being exited unless LL-bit is cleared.
This is why it is not really a HINT, even though a particular hardware
implementation could implement it as a NOP.

| If you can provide some clarity regarding my questions about RISC-V PAUSE then we might be able to make it work as is. I think it is the ambiguity that drives me towards the familiar MIPS PAUSE.

Hope this helps,
Krste


| Sanjay

| On 1/14/21, 8:31 AM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:


| As far as I can tell, the only advantage of the MIPS-style PAUSE over
| the RISC-V proposed PAUSE would be to remove the need for a timeout
| counter in the microarchitecture. But in the bimodal form you've
| given, you still need the timeout counter, but have also stopped PAUSE
| from being a HINT (and added definition of LL-bit to LR/SC).

| I don't believe the reduction in polling overhead is a significant
| factor, as an implementation can set the default PAUSE duration such
| that this is sufficiently low.

| What is the objection to the PAUSE occasionally causing a retry and a
| re-PAUSE when timer expires and LL-bit is set? This will happen
| anyway if there are interrupts.

| Krste



|||||| On Mon, 11 Jan 2021 18:01:38 +0000, Sanjay Patel <spatel@wavecomp.com> said:

| | Hi Krste,
| | One last try to try to unify the definitions. 😊 This is to avoid creating a custom WAIT-FOR-X, equivalent to MIPS PAUSE.

| | I’ve reformatted the MIPS code you had included such that the operation with LL(LR)/SC is clear in the context of PAUSE use.

| | acquire_lock:

| | ll t0, 0(a0)

| | bnez t0, acquire_lock_retry

| | addiu t0, t0, 1

| | sc t0, 0(a0)

| | bnez t0, 10f

| | sync

| | acquire_lock_retry:

| | pause

| | b acquire_lock

| | 10:

| | Critical region code

| | release_lock:

| | sync

| | sw zero, 0(a0)

| | A MIPS implementation will set a hardware “LLbit” on execution of LL. We can implement a bi-modal RISC-V PAUSE that waits on the snoop if LLbit is
| | set. If LLbit is not set, then PAUSE waits on the timer expiry. I do think this is a viable solution as it provides the same guarantee of exit as
| | the timer expiry.

| | On the other hand, if you think it is architecturally messy to implement the bi-modal behavior, then we can implement a custom WAIT-FOR-X for LR/
| | SC, and transition this to a RISC-V WAIT-FOR-X when included in the RISC-V architecture.

| | Sanjay

| | On 1/8/21, 6:42 PM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:

| |||||| On Thu, 7 Jan 2021 16:52:35 +0000, "Sanjay Patel" <spatel@wavecomp.com> said:

| | | Thanks Krste.

| | | This is helpful and we can map MIPS PAUSE behavior to RISC-V PAUSE.

| | | <snip>

| | | Yes, and you can implement the exact same semantics as MIPS PAUSE to

| | | end PAUSE on LL-bit activity as a microarch optimization, but must

| | | ensure can't PAUSE indefinitely on RISC-V.

| | | <snip>

| | | Your sentence also implies we need to implement a counter of similar to ensure an exit, if not on the basis of a snoop event.

| | | This isn't clear. If the lock is never cleared for some reason, the counter expiring would cause the code to loop again to check the
| | reservation, but it would not gracefully terminate.

| | | So basically I'd like to understand how the stated requirement helps.

| | If the RISC-V PAUSE could hang indefinitely on any use, it would be

| | impossible to prove progress of some algorithms, unless there is some

| | environmental mechanism guaranteed to cause PAUSE to resume. These

| | environmental mechanisms would be hard to define in a common way

| | across all the systems RISC-V is supporting.

| | In your particular scenario, where PAUSE has semantics of exiting on

| | access to a specified memory location or set of locations, and the

| | lock protocol is known, you can build confidence you don't need a

| | timeout to ensure progress.

| | However, if the PAUSE is in a loop waiting for some condition where it

| | is not easily determined by the core the condition has been satisfied

| | (e.g., a bit in a device register), then a potentially indefinite

| | PAUSE is unusable.

| | I'm not opposed to adding specific "WAIT-FOR-X" instructions in

| | addition to PAUSE, but they can't replace the use of simple PAUSE when

| | waiting for events whose trigger is not available to the core. The

| | WAIT-FOR instructions will also generally have more complex and

| | system-specific semantics.

| | Krste

| | | Sanjay


Phil McCoy <pnm@...>
 

If the MIPS-like heuristic of exiting the PAUSE early when the LR reservation gets cleared becomes popular, it would be useful to modify the standard RISC-V AMO code idiom to inform the implementation that the software is waiting for the lock variable to change:

again:
pause
acquire_lock: /* Entry point */
lr t1, (a0) // lr instead of lw as hint to enable optimized PAUSE handling for the lock variable
bnez t1, again
li t0, 1
amoswap.w.aq t1, t0, (a0)
bnez t1, again
/* Critical section */
amoswap.w.rl x0, x0, (a0) /* Release lock */

This loop would spin very slowly until the lock becomes available, but still react quickly when the lock is released by another hart.

Just wondering of this optimization seems interesting to anyone outside of MIPS. If there is interest, maybe a non-normative comment recommending this heuristic would be in order.

One other minor comment - can't the lock be released by a simple store instead of an amoswap to x0?

Cheers,
Phil


Sanjay Patel
 

Hello,

 

I'd like to conclude discussions around the PAUSE specification as it relates to LR/SC.

 

In our implementation we will be optimizing LR/SC handing based on the event of clearing the reservation while simultaneously allowing a timer/count value to cause exit from PAUSE on expiry.

 

I request that the PAUSE description be changed. The current description states the following:

 

“Like other FENCE instructions, PAUSE cannot be used within LR/SC sequences”

 

I would like to have that replaced by

 

Options:

  1. An implementation may choose to optimize an LR/SC lock sequence by causing the PAUSE to exit either when a bounded event (such as a timer expiry) occurs, or when an LR reservation that was active at the time the pause was issued subsequently clears. In such an implementation, the pause counter may use a higher time limit when an active LR reservation is present, exploiting the fact that for appropriately constructed code sequences the hart cannot acquire the required lock until its LR reservation clears.
  2. An implementation may choose to optimize an LR/SC lock sequence by causing the PAUSE to exit either when a bounded event (such as a timer expiry) occurs, or when an LR reservation that was active at the time the pause was issued subsequently clears.

 

The first is more verbose and includes a description of an implementation optimization. The second is less verbose and leaves the detail to the individual implementation.

 

Or there can be a 3rd choice of whatever the workgroup considers appropriate.

 

Pls let me know your feedback.

 

Sanjay

 

Btw, I’m curious about the constraint placed on the use of FENCE with LR/SC. Is it because waiting for the barrier to clear would result in relatively unbounded delay in exiting PAUSE? Further, I assume this is not a concern for PAUSE which leverages FENCE as a hint, since the successor group is 0 and thus the barrier behavior need not be implemented.

 

 

On 1/15/21, 12:16 PM, "krste@..." <krste@...> wrote:

 

 

    >>>>> On Fri, 15 Jan 2021 17:14:23 +0000, Sanjay Patel <spatel@...> said:

 

    | I am partial towards MIPS PAUSE for LR/SC because it is optimized for the specific event that should clear the reservation, the snoop. Further, the response doesn't have to be "sized" as in the case of the RISC-V PAUSE timer, as the response is in response to the number of threads contending for the lock. i.e., the MIPS PAUSE handling just works out of the box. I do agree though that it is specific to LR/SC and not applicable to the AMOs. But that's why I wanted it bi-modal, to handle the two cases.

 

    You can implement the LL-bit as a purely microarchitectural

    optimization for your use case.  For example, you can use two

    different timeouts for the case where LL-bit is set or not.  I believe

    this captures almost all of the benefits of the MIPS approach while

    avoiding adding complexity to the RISC-V ISA spec.

 

    | Is it recommended that the RISC-V timer be programmable? If so, is it better to be an M-mode register? The reason I say M-mode is that if the count can be programmed arbitrarily by any thread, then there is the risk of livelocks since a thread that has a short cycle count will poll the lock more frequently. While the timer is not specified directly in the specification, I think it should as a possible or recommended implementation.

 

    | Also, how does a software programmer assess the value of counter? I would assume it is the latency from the core to the point of shared coherency.

    | Other rules may have to be specified for PAUSE. Does an interrupt cause the timer to be cleared? I think it should.

    | (Btw, the concerns I raise could just be a matter of good engineering. )

 

    Software locking code can add adaptive backoff loop if required and

    with no need to involve either more privileged layers or additional

    hardware (neither of which will understand what application software

    is actually trying to do).

 

    | I'm not clear how the definition I've provided prevents RISC-V from being use as a HINT. HINTS should not modify architecture state, and implementations can choose to NOP the instruction. I don't think I've violated either rule.

 

    The MIPS PAUSE mandates the existence of an LL-bit, and has behavior

    that depends on this LL-bit.  I realize that the MIPS spec does not

    guarantee that threads will stop at a PAUSE with LL-bit set, but this

    is allowed by spec and software could rely on it, or more importantly,

    software cannot rely on PAUSE being exited unless LL-bit is cleared.

    This is why it is not really a HINT, even though a particular hardware

    implementation could implement it as a NOP.

 

    | If you can provide some clarity regarding my questions about RISC-V PAUSE then we might be able to make it work as is. I think it is the ambiguity that drives me towards the familiar MIPS PAUSE.

 

    Hope this helps,

    Krste

 

 

    | Sanjay

 

    | On 1/14/21, 8:31 AM, "krste@..." <krste@...> wrote:

 

 

    |     As far as I can tell, the only advantage of the MIPS-style PAUSE over

    |     the RISC-V proposed PAUSE would be to remove the need for a timeout

    |     counter in the microarchitecture.  But in the bimodal form you've

    |     given, you still need the timeout counter, but have also stopped PAUSE

    |     from being a HINT (and added definition of LL-bit to LR/SC).

 

    |     I don't believe the reduction in polling overhead is a significant

    |     factor, as an implementation can set the default PAUSE duration such

    |     that this is sufficiently low.

 

    |     What is the objection to the PAUSE occasionally causing a retry and a

    |     re-PAUSE when timer expires and LL-bit is set?  This will happen

    |     anyway if there are interrupts.

 

    |     Krste

 

 

 

    |||||| On Mon, 11 Jan 2021 18:01:38 +0000, Sanjay Patel <spatel@...> said:

 

    |     | Hi Krste,

    |     | One last try to try to unify the definitions. 😊 This is to avoid creating a custom WAIT-FOR-X, equivalent to MIPS PAUSE.

 

    |     | I’ve reformatted the MIPS code you had included such that the operation with LL(LR)/SC is clear in the context of PAUSE use.

 

    |     | acquire_lock:

 

    |     |          ll t0, 0(a0)    

 

    |     |          bnez t0, acquire_lock_retry

 

    |     |          addiu t0, t0, 1

 

    |     |          sc t0, 0(a0)

 

    |     |          bnez t0, 10f

 

    |     |          sync

 

    |     | acquire_lock_retry:

 

    |     |          pause

 

    |     |          b acquire_lock

 

    |     | 10:

 

    |     |          Critical region code

 

    |     | release_lock:

 

    |     |          sync

 

    |     |          sw zero, 0(a0)

 

    |     | A MIPS implementation will set a hardware “LLbit” on execution of LL. We can implement a bi-modal RISC-V PAUSE that waits on the snoop if LLbit is

    |     | set. If LLbit is not set, then PAUSE waits on the timer expiry. I do think this is a viable solution as it provides the same guarantee of exit as

    |     | the timer expiry.

 

    |     | On the other hand, if you think it is architecturally messy to implement the bi-modal behavior, then we can implement a custom WAIT-FOR-X for LR/

    |     | SC, and transition this to a RISC-V WAIT-FOR-X when included in the RISC-V architecture.

 

    |     | Sanjay

 

    |     | On 1/8/21, 6:42 PM, "krste@..." <krste@...> wrote:

 

    |     |||||| On Thu, 7 Jan 2021 16:52:35 +0000, "Sanjay Patel" <spatel@...> said:

 

    |     |     | Thanks Krste.

 

    |     |     | This is helpful and we can map MIPS PAUSE behavior to RISC-V PAUSE.

 

    |     |     | <snip>

 

    |     |     | Yes, and you can implement the exact same semantics as MIPS PAUSE to

 

    |     |     |     end PAUSE on LL-bit activity as a microarch optimization, but must

 

    |     |     |     ensure can't PAUSE indefinitely on RISC-V.

 

    |     |     | <snip>

 

    |     |     | Your sentence also implies we need to implement a counter of similar to ensure an exit, if not on the basis of a snoop event.

 

    |     |     | This isn't clear. If the lock is never cleared for some reason, the counter expiring would cause the code to loop again to check the

    |     | reservation, but it would not gracefully terminate.

 

    |     |     | So basically I'd like to understand how the stated requirement helps.

 

    |     |     If the RISC-V PAUSE could hang indefinitely on any use, it would be

 

    |     |     impossible to prove progress of some algorithms, unless there is some

 

    |     |     environmental mechanism guaranteed to cause PAUSE to resume.  These

 

    |     |     environmental mechanisms would be hard to define in a common way

 

    |     |     across all the systems RISC-V is supporting.

 

    |     |     In your particular scenario, where PAUSE has semantics of exiting on

 

    |     |     access to a specified memory location or set of locations, and the

 

    |     |     lock protocol is known, you can build confidence you don't need a

 

    |     |     timeout to ensure progress.

 

    |     |     However, if the PAUSE is in a loop waiting for some condition where it

 

    |     |     is not easily determined by the core the condition has been satisfied

 

    |     |     (e.g., a bit in a device register), then a potentially indefinite

 

    |     |     PAUSE is unusable.

 

    |     |     I'm not opposed to adding specific "WAIT-FOR-X" instructions in

 

    |     |     addition to PAUSE, but they can't replace the use of simple PAUSE when

 

    |     |     waiting for events whose trigger is not available to the core.  The

 

    |     |     WAIT-FOR instructions will also generally have more complex and

 

    |     |     system-specific semantics.

 

    |     |     Krste

 

    |     |     | Sanjay

 

 


Allen Baum
 

Could you re-send the code with the LR/SC sequence you want to implement again?
Phil's example above has an LR, but without an SC, so technically not in an LR/SC sequence.

If it is inside an LR/SC pair, then it falls into the "unconstrained" LR/SC sequence. 
This removes some forward progress guarantees, but your use case might make that moot.
It sounds like loosening up the language to not prohibit it, but just warn of the consequences, may be in order.





On Fri, Feb 5, 2021 at 12:18 PM Sanjay Patel <spatel@...> wrote:

Hello,

 

I'd like to conclude discussions around the PAUSE specification as it relates to LR/SC.

 

In our implementation we will be optimizing LR/SC handing based on the event of clearing the reservation while simultaneously allowing a timer/count value to cause exit from PAUSE on expiry.

 

I request that the PAUSE description be changed. The current description states the following:

 

“Like other FENCE instructions, PAUSE cannot be used within LR/SC sequences”

 

I would like to have that replaced by

 

Options:

  1. An implementation may choose to optimize an LR/SC lock sequence by causing the PAUSE to exit either when a bounded event (such as a timer expiry) occurs, or when an LR reservation that was active at the time the pause was issued subsequently clears. In such an implementation, the pause counter may use a higher time limit when an active LR reservation is present, exploiting the fact that for appropriately constructed code sequences the hart cannot acquire the required lock until its LR reservation clears.
  2. An implementation may choose to optimize an LR/SC lock sequence by causing the PAUSE to exit either when a bounded event (such as a timer expiry) occurs, or when an LR reservation that was active at the time the pause was issued subsequently clears.

 

The first is more verbose and includes a description of an implementation optimization. The second is less verbose and leaves the detail to the individual implementation.

 

Or there can be a 3rd choice of whatever the workgroup considers appropriate.

 

Pls let me know your feedback.

 

Sanjay

 

Btw, I’m curious about the constraint placed on the use of FENCE with LR/SC. Is it because waiting for the barrier to clear would result in relatively unbounded delay in exiting PAUSE? Further, I assume this is not a concern for PAUSE which leverages FENCE as a hint, since the successor group is 0 and thus the barrier behavior need not be implemented.

 

 

On 1/15/21, 12:16 PM, "krste@..." <krste@...> wrote:

 

 

    >>>>> On Fri, 15 Jan 2021 17:14:23 +0000, Sanjay Patel <spatel@...> said:

 

    | I am partial towards MIPS PAUSE for LR/SC because it is optimized for the specific event that should clear the reservation, the snoop. Further, the response doesn't have to be "sized" as in the case of the RISC-V PAUSE timer, as the response is in response to the number of threads contending for the lock. i.e., the MIPS PAUSE handling just works out of the box. I do agree though that it is specific to LR/SC and not applicable to the AMOs. But that's why I wanted it bi-modal, to handle the two cases.

 

    You can implement the LL-bit as a purely microarchitectural

    optimization for your use case.  For example, you can use two

    different timeouts for the case where LL-bit is set or not.  I believe

    this captures almost all of the benefits of the MIPS approach while

    avoiding adding complexity to the RISC-V ISA spec.

 

    | Is it recommended that the RISC-V timer be programmable? If so, is it better to be an M-mode register? The reason I say M-mode is that if the count can be programmed arbitrarily by any thread, then there is the risk of livelocks since a thread that has a short cycle count will poll the lock more frequently. While the timer is not specified directly in the specification, I think it should as a possible or recommended implementation.

 

    | Also, how does a software programmer assess the value of counter? I would assume it is the latency from the core to the point of shared coherency.

    | Other rules may have to be specified for PAUSE. Does an interrupt cause the timer to be cleared? I think it should.

    | (Btw, the concerns I raise could just be a matter of good engineering. )

 

    Software locking code can add adaptive backoff loop if required and

    with no need to involve either more privileged layers or additional

    hardware (neither of which will understand what application software

    is actually trying to do).

 

    | I'm not clear how the definition I've provided prevents RISC-V from being use as a HINT. HINTS should not modify architecture state, and implementations can choose to NOP the instruction. I don't think I've violated either rule.

 

    The MIPS PAUSE mandates the existence of an LL-bit, and has behavior

    that depends on this LL-bit.  I realize that the MIPS spec does not

    guarantee that threads will stop at a PAUSE with LL-bit set, but this

    is allowed by spec and software could rely on it, or more importantly,

    software cannot rely on PAUSE being exited unless LL-bit is cleared.

    This is why it is not really a HINT, even though a particular hardware

    implementation could implement it as a NOP.

 

    | If you can provide some clarity regarding my questions about RISC-V PAUSE then we might be able to make it work as is. I think it is the ambiguity that drives me towards the familiar MIPS PAUSE.

 

    Hope this helps,

    Krste

 

 

    | Sanjay

 

    | On 1/14/21, 8:31 AM, "krste@..." <krste@...> wrote:

 

 

    |     As far as I can tell, the only advantage of the MIPS-style PAUSE over

    |     the RISC-V proposed PAUSE would be to remove the need for a timeout

    |     counter in the microarchitecture.  But in the bimodal form you've

    |     given, you still need the timeout counter, but have also stopped PAUSE

    |     from being a HINT (and added definition of LL-bit to LR/SC).

 

    |     I don't believe the reduction in polling overhead is a significant

    |     factor, as an implementation can set the default PAUSE duration such

    |     that this is sufficiently low.

 

    |     What is the objection to the PAUSE occasionally causing a retry and a

    |     re-PAUSE when timer expires and LL-bit is set?  This will happen

    |     anyway if there are interrupts.

 

    |     Krste

 

 

 

    |||||| On Mon, 11 Jan 2021 18:01:38 +0000, Sanjay Patel <spatel@...> said:

 

    |     | Hi Krste,

    |     | One last try to try to unify the definitions. 😊 This is to avoid creating a custom WAIT-FOR-X, equivalent to MIPS PAUSE.

 

    |     | I’ve reformatted the MIPS code you had included such that the operation with LL(LR)/SC is clear in the context of PAUSE use.

 

    |     | acquire_lock:

 

    |     |          ll t0, 0(a0)    

 

    |     |          bnez t0, acquire_lock_retry

 

    |     |          addiu t0, t0, 1

 

    |     |          sc t0, 0(a0)

 

    |     |          bnez t0, 10f

 

    |     |          sync

 

    |     | acquire_lock_retry:

 

    |     |          pause

 

    |     |          b acquire_lock

 

    |     | 10:

 

    |     |          Critical region code

 

    |     | release_lock:

 

    |     |          sync

 

    |     |          sw zero, 0(a0)

 

    |     | A MIPS implementation will set a hardware “LLbit” on execution of LL. We can implement a bi-modal RISC-V PAUSE that waits on the snoop if LLbit is

    |     | set. If LLbit is not set, then PAUSE waits on the timer expiry. I do think this is a viable solution as it provides the same guarantee of exit as

    |     | the timer expiry.

 

    |     | On the other hand, if you think it is architecturally messy to implement the bi-modal behavior, then we can implement a custom WAIT-FOR-X for LR/

    |     | SC, and transition this to a RISC-V WAIT-FOR-X when included in the RISC-V architecture.

 

    |     | Sanjay

 

    |     | On 1/8/21, 6:42 PM, "krste@..." <krste@...> wrote:

 

    |     |||||| On Thu, 7 Jan 2021 16:52:35 +0000, "Sanjay Patel" <spatel@...> said:

 

    |     |     | Thanks Krste.

 

    |     |     | This is helpful and we can map MIPS PAUSE behavior to RISC-V PAUSE.

 

    |     |     | <snip>

 

    |     |     | Yes, and you can implement the exact same semantics as MIPS PAUSE to

 

    |     |     |     end PAUSE on LL-bit activity as a microarch optimization, but must

 

    |     |     |     ensure can't PAUSE indefinitely on RISC-V.

 

    |     |     | <snip>

 

    |     |     | Your sentence also implies we need to implement a counter of similar to ensure an exit, if not on the basis of a snoop event.

 

    |     |     | This isn't clear. If the lock is never cleared for some reason, the counter expiring would cause the code to loop again to check the

    |     | reservation, but it would not gracefully terminate.

 

    |     |     | So basically I'd like to understand how the stated requirement helps.

 

    |     |     If the RISC-V PAUSE could hang indefinitely on any use, it would be

 

    |     |     impossible to prove progress of some algorithms, unless there is some

 

    |     |     environmental mechanism guaranteed to cause PAUSE to resume.  These

 

    |     |     environmental mechanisms would be hard to define in a common way

 

    |     |     across all the systems RISC-V is supporting.

 

    |     |     In your particular scenario, where PAUSE has semantics of exiting on

 

    |     |     access to a specified memory location or set of locations, and the

 

    |     |     lock protocol is known, you can build confidence you don't need a

 

    |     |     timeout to ensure progress.

 

    |     |     However, if the PAUSE is in a loop waiting for some condition where it

 

    |     |     is not easily determined by the core the condition has been satisfied

 

    |     |     (e.g., a bit in a device register), then a potentially indefinite

 

    |     |     PAUSE is unusable.

 

    |     |     I'm not opposed to adding specific "WAIT-FOR-X" instructions in

 

    |     |     addition to PAUSE, but they can't replace the use of simple PAUSE when

 

    |     |     waiting for events whose trigger is not available to the core.  The

 

    |     |     WAIT-FOR instructions will also generally have more complex and

 

    |     |     system-specific semantics.

 

    |     |     Krste

 

    |     |     | Sanjay

 

 


Krste Asanovic
 

On Fri, 05 Feb 2021 10:00:50 -0800, "Phil McCoy" <pnm@computer.org> said:
| If the MIPS-like heuristic of exiting the PAUSE early when the LR reservation gets cleared becomes
| popular, it would be useful to modify the standard RISC-V AMO code idiom to inform the implementation that
| the software is waiting for the lock variable to change:

| again:
| pause
| acquire_lock: /* Entry point */
| lr t1, (a0) // lr instead of lw as hint to enable optimized PAUSE handling for the lock variable
| bnez t1, again
| li t0, 1
| amoswap.w.aq t1, t0, (a0)
| bnez t1, again
| /* Critical section */
| amoswap.w.rl x0, x0, (a0) /* Release lock */

A comment on making this the standard idiom is that more
implementations have AMOs (Zaamo) than also have LR/SC (Zalrsc).
LR/SC tends to rely on there being a coherent cache system.

| This loop would spin very slowly until the lock becomes available, but still react quickly when the lock
| is released by another hart.

| Just wondering of this optimization seems interesting to anyone outside of MIPS. If there is interest,
| maybe a non-normative comment recommending this heuristic would be in order.

| One other minor comment - can't the lock be released by a simple store instead of an amoswap to x0?

The amoswap provides release ordering as well as the store. A simple
store would need a FENCE too.

Krste

| Cheers,
| Phil
|


Krste Asanovic
 

I can't see how we can require a sequence with a PAUSE have a forward
progress guarantee, so existing spec is correct.

Software can detect lack of progress by incrementing a counter, and
switch to a loop without the PAUSE.

Krste

On Tue, 9 Feb 2021 22:58:09 -0800, Allen Baum <allen.baum@esperantotech.com> said:
| Could you re-send the code with the LR/SC sequence you want to implement again?
| Phil's example above has an LR, but without an SC, so technically not in an LR/SC sequence.

| If it is inside an LR/SC pair, then it falls into the "unconstrained" LR/SC sequence. 
| This removes some forward progress guarantees, but your use case might make that moot.
| It sounds like loosening up the language to not prohibit it, but just warn of the consequences, may be in
| order.

| On Fri, Feb 5, 2021 at 12:18 PM Sanjay Patel <spatel@wavecomp.com> wrote:

| Hello,

|  

| I'd like to conclude discussions around the PAUSE specification as it relates to LR/SC.

|  

| In our implementation we will be optimizing LR/SC handing based on the event of clearing the
| reservation while simultaneously allowing a timer/count value to cause exit from PAUSE on expiry.

|  

| I request that the PAUSE description be changed. The current description states the following:

|  

| “Like other FENCE instructions, PAUSE cannot be used within LR/SC sequences”

|  

| I would like to have that replaced by

|  

| Options:

| 1. An implementation may choose to optimize an LR/SC lock sequence by causing the PAUSE to exit
| either when a bounded event (such as a timer expiry) occurs, or when an LR reservation that was
| active at the time the pause was issued subsequently clears. In such an implementation, the pause
| counter may use a higher time limit when an active LR reservation is present, exploiting the fact
| that for appropriately constructed code sequences the hart cannot acquire the required lock until
| its LR reservation clears.
| 2. An implementation may choose to optimize an LR/SC lock sequence by causing the PAUSE to exit
| either when a bounded event (such as a timer expiry) occurs, or when an LR reservation that was
| active at the time the pause was issued subsequently clears.

|  

| The first is more verbose and includes a description of an implementation optimization. The second is
| less verbose and leaves the detail to the individual implementation.

|  

| Or there can be a 3^rd choice of whatever the workgroup considers appropriate.

|  

| Pls let me know your feedback.

|  

| Sanjay

|  

| Btw, I’m curious about the constraint placed on the use of FENCE with LR/SC. Is it because waiting for
| the barrier to clear would result in relatively unbounded delay in exiting PAUSE? Further, I assume
| this is not a concern for PAUSE which leverages FENCE as a hint, since the successor group is 0 and
| thus the barrier behavior need not be implemented.

|  

|  

| On 1/15/21, 12:16 PM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:

|  

|  

|     >>>>> On Fri, 15 Jan 2021 17:14:23 +0000, Sanjay Patel <spatel@wavecomp.com> said:

|  

|     | I am partial towards MIPS PAUSE for LR/SC because it is optimized for the specific event that
| should clear the reservation, the snoop. Further, the response doesn't have to be "sized" as in the
| case of the RISC-V PAUSE timer, as the response is in response to the number of threads contending for
| the lock. i.e., the MIPS PAUSE handling just works out of the box. I do agree though that it is
| specific to LR/SC and not applicable to the AMOs. But that's why I wanted it bi-modal, to handle the
| two cases.

|  

|     You can implement the LL-bit as a purely microarchitectural

|     optimization for your use case.  For example, you can use two

|     different timeouts for the case where LL-bit is set or not.  I believe

|     this captures almost all of the benefits of the MIPS approach while

|     avoiding adding complexity to the RISC-V ISA spec.

|  

|     | Is it recommended that the RISC-V timer be programmable? If so, is it better to be an M-mode
| register? The reason I say M-mode is that if the count can be programmed arbitrarily by any thread,
| then there is the risk of livelocks since a thread that has a short cycle count will poll the lock
| more frequently. While the timer is not specified directly in the specification, I think it should as
| a possible or recommended implementation.

|  

|     | Also, how does a software programmer assess the value of counter? I would assume it is the
| latency from the core to the point of shared coherency.

|     | Other rules may have to be specified for PAUSE. Does an interrupt cause the timer to be cleared?
| I think it should.

|     | (Btw, the concerns I raise could just be a matter of good engineering. )

|  

|     Software locking code can add adaptive backoff loop if required and

|     with no need to involve either more privileged layers or additional

|     hardware (neither of which will understand what application software

|     is actually trying to do).

|  

|     | I'm not clear how the definition I've provided prevents RISC-V from being use as a HINT. HINTS
| should not modify architecture state, and implementations can choose to NOP the instruction. I don't
| think I've violated either rule.

|  

|     The MIPS PAUSE mandates the existence of an LL-bit, and has behavior

|     that depends on this LL-bit.  I realize that the MIPS spec does not

|     guarantee that threads will stop at a PAUSE with LL-bit set, but this

|     is allowed by spec and software could rely on it, or more importantly,

|     software cannot rely on PAUSE being exited unless LL-bit is cleared.

|     This is why it is not really a HINT, even though a particular hardware

|     implementation could implement it as a NOP.

|  

|     | If you can provide some clarity regarding my questions about RISC-V PAUSE then we might be able
| to make it work as is. I think it is the ambiguity that drives me towards the familiar MIPS PAUSE.

|  

|     Hope this helps,

|     Krste

|  

|  

|     | Sanjay

|  

|     | On 1/14/21, 8:31 AM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:

|  

|  

|     |     As far as I can tell, the only advantage of the MIPS-style PAUSE over

|     |     the RISC-V proposed PAUSE would be to remove the need for a timeout

|     |     counter in the microarchitecture.  But in the bimodal form you've

|     |     given, you still need the timeout counter, but have also stopped PAUSE

|     |     from being a HINT (and added definition of LL-bit to LR/SC).

|  

|     |     I don't believe the reduction in polling overhead is a significant

|     |     factor, as an implementation can set the default PAUSE duration such

|     |     that this is sufficiently low.

|  

|     |     What is the objection to the PAUSE occasionally causing a retry and a

|     |     re-PAUSE when timer expires and LL-bit is set?  This will happen

|     |     anyway if there are interrupts.

|  

|     |     Krste

|  

|  

|  

|     |||||| On Mon, 11 Jan 2021 18:01:38 +0000, Sanjay Patel <spatel@wavecomp.com> said:

|  

|     |     | Hi Krste,

|     |     | One last try to try to unify the definitions. 😊 This is to avoid creating a custom
| WAIT-FOR-X, equivalent to MIPS PAUSE.

|  

|     |     | I’ve reformatted the MIPS code you had included such that the operation with LL(LR)/SC is
| clear in the context of PAUSE use.

|  

|     |     | acquire_lock:

|  

|     |     |          ll t0, 0(a0)    

|  

|     |     |          bnez t0, acquire_lock_retry

|  

|     |     |          addiu t0, t0, 1

|  

|     |     |          sc t0, 0(a0)

|  

|     |     |          bnez t0, 10f

|  

|     |     |          sync

|  

|     |     | acquire_lock_retry:

|  

|     |     |          pause

|  

|     |     |          b acquire_lock

|  

|     |     | 10:

|  

|     |     |          Critical region code

|  

|     |     | release_lock:

|  

|     |     |          sync

|  

|     |     |          sw zero, 0(a0)

|  

|     |     | A MIPS implementation will set a hardware “LLbit” on execution of LL. We can implement a
| bi-modal RISC-V PAUSE that waits on the snoop if LLbit is

|     |     | set. If LLbit is not set, then PAUSE waits on the timer expiry. I do think this is a
| viable solution as it provides the same guarantee of exit as

|     |     | the timer expiry.

|  

|     |     | On the other hand, if you think it is architecturally messy to implement the bi-modal
| behavior, then we can implement a custom WAIT-FOR-X for LR/

|     |     | SC, and transition this to a RISC-V WAIT-FOR-X when included in the RISC-V architecture.

|  

|     |     | Sanjay

|  

|     |     | On 1/8/21, 6:42 PM, "krste@berkeley.edu" <krste@berkeley.edu> wrote:

|  

|     |     |||||| On Thu, 7 Jan 2021 16:52:35 +0000, "Sanjay Patel" <spatel@wavecomp.com> said:

|  

|     |     |     | Thanks Krste.

|  

|     |     |     | This is helpful and we can map MIPS PAUSE behavior to RISC-V PAUSE.

|  

|     |     |     | <snip>

|  

|     |     |     | Yes, and you can implement the exact same semantics as MIPS PAUSE to

|  

|     |     |     |     end PAUSE on LL-bit activity as a microarch optimization, but must

|  

|     |     |     |     ensure can't PAUSE indefinitely on RISC-V.

|  

|     |     |     | <snip>

|  

|     |     |     | Your sentence also implies we need to implement a counter of similar to ensure an
| exit, if not on the basis of a snoop event.

|  

|     |     |     | This isn't clear. If the lock is never cleared for some reason, the counter expiring
| would cause the code to loop again to check the

|     |     | reservation, but it would not gracefully terminate.

|  

|     |     |     | So basically I'd like to understand how the stated requirement helps.

|  

|     |     |     If the RISC-V PAUSE could hang indefinitely on any use, it would be

|  

|     |     |     impossible to prove progress of some algorithms, unless there is some

|  

|     |     |     environmental mechanism guaranteed to cause PAUSE to resume.  These

|  

|     |     |     environmental mechanisms would be hard to define in a common way

|  

|     |     |     across all the systems RISC-V is supporting.

|  

|     |     |     In your particular scenario, where PAUSE has semantics of exiting on

|  

|     |     |     access to a specified memory location or set of locations, and the

|  

|     |     |     lock protocol is known, you can build confidence you don't need a

|  

|     |     |     timeout to ensure progress.

|  

|     |     |     However, if the PAUSE is in a loop waiting for some condition where it

|  

|     |     |     is not easily determined by the core the condition has been satisfied

|  

|     |     |     (e.g., a bit in a device register), then a potentially indefinite

|  

|     |     |     PAUSE is unusable.

|  

|     |     |     I'm not opposed to adding specific "WAIT-FOR-X" instructions in

|  

|     |     |     addition to PAUSE, but they can't replace the use of simple PAUSE when

|  

|     |     |     waiting for events whose trigger is not available to the core.  The

|  

|     |     |     WAIT-FOR instructions will also generally have more complex and

|  

|     |     |     system-specific semantics.

|  

|     |     |     Krste

|  

|     |     |     | Sanjay

|  

|


Paul Donahue
 

What are Zaamo and Zalrsc?  I can imagine what they might be but I don't see any indication in a spec that they exist.  My reading is that the A extension requires implementation of LR, SC, and AMO* and that implementing only a subset would be a non-conforming extension.


Thanks,

-Paul


On Wed, Feb 10, 2021 at 1:46 AM Krste Asanovic <krste@...> wrote:

>>>>> On Fri, 05 Feb 2021 10:00:50 -0800, "Phil McCoy" <pnm@...> said:

| If the MIPS-like heuristic of exiting the PAUSE early when the LR reservation gets cleared becomes
| popular, it would be useful to modify the standard RISC-V AMO code idiom to inform the implementation that
| the software is waiting for the lock variable to change:

| again:
| pause
| acquire_lock: /* Entry point */
| lr t1, (a0) // lr instead of lw as hint to enable optimized PAUSE handling for the lock variable
| bnez t1, again
| li t0, 1
| amoswap.w.aq t1, t0, (a0)
| bnez t1, again
| /* Critical section */
| amoswap.w.rl x0, x0, (a0) /* Release lock */

A comment on making this the standard idiom is that more
implementations have AMOs (Zaamo) than also have LR/SC (Zalrsc).
LR/SC tends to rely on there being a coherent cache system.

| This loop would spin very slowly until the lock becomes available, but still react quickly when the lock
| is released by another hart.

| Just wondering of this optimization seems interesting to anyone outside of MIPS. If there is interest,
| maybe a non-normative comment recommending this heuristic would be in order.

| One other minor comment - can't the lock be released by a simple store instead of an amoswap to x0?

The amoswap provides release ordering as well as the store.  A simple
store would need a FENCE too.

Krste

| Cheers,
| Phil
|






Sanjay Patel
 

Here is the sequence that uses LR/SC and PAUSE.

 

again:

    pause               /* LRbit clears / timer exit */

acquire_lock:           /* Entry point */

    lr.w t1, (a0)       /* Set LRbit; was lr.w t1,(a0)*/

    bnez t1, again

    li t0, 1

    sc t1, t0, (a0)     /* clear LRbit if lock acquired; was amoswap.w.aq t1,t0,(a0) */

    bnez t1, again

    /* Critical section */

    amoswap.w.rl x0, x0, (a0)  /* Release lock */

 

Technically,  the PAUSE is still within the LR/SC sequence because it always depends on having an LR precede it in setting the reservation, which I call LRbit.  We however have taken your advice and allow the timer (with a higher value) to clear the LRbit if it expires before a snoop clears the LRbit. Does this violate forward progress guarantees?

 

Sanjay

 

From: <tech-unprivileged@...> on behalf of "Allen Baum via lists.riscv.org" <allen.baum=esperantotech.com@...>
Reply-To: "tech-unprivileged@..." <tech-unprivileged@...>, "allen.baum@..." <allen.baum@...>
Date: Tuesday, February 9, 2021 at 10:58 PM
To: Sanjay Patel <spatel@...>
Cc: "krste@..." <krste@...>, "tech-unprivileged@..." <tech-unprivileged@...>, Greg Favor <gfavor@...>
Subject: Re: [EXTERNAL]Re: [RISC-V] [tech-unprivileged] PAUSE for LR/SC

 

Could you re-send the code with the LR/SC sequence you want to implement again?

Phil's example above has an LR, but without an SC, so technically not in an LR/SC sequence.

 

If it is inside an LR/SC pair, then it falls into the "unconstrained" LR/SC sequence. 

This removes some forward progress guarantees, but your use case might make that moot.

It sounds like loosening up the language to not prohibit it, but just warn of the consequences, may be in order.

 

 

 

 

 

On Fri, Feb 5, 2021 at 12:18 PM Sanjay Patel <spatel@...> wrote:

Hello,

 

I'd like to conclude discussions around the PAUSE specification as it relates to LR/SC.

 

In our implementation we will be optimizing LR/SC handing based on the event of clearing the reservation while simultaneously allowing a timer/count value to cause exit from PAUSE on expiry.

 

I request that the PAUSE description be changed. The current description states the following:

 

“Like other FENCE instructions, PAUSE cannot be used within LR/SC sequences”

 

I would like to have that replaced by

 

Options:

  1. An implementation may choose to optimize an LR/SC lock sequence by causing the PAUSE to exit either when a bounded event (such as a timer expiry) occurs, or when an LR reservation that was active at the time the pause was issued subsequently clears. In such an implementation, the pause counter may use a higher time limit when an active LR reservation is present, exploiting the fact that for appropriately constructed code sequences the hart cannot acquire the required lock until its LR reservation clears.
  2. An implementation may choose to optimize an LR/SC lock sequence by causing the PAUSE to exit either when a bounded event (such as a timer expiry) occurs, or when an LR reservation that was active at the time the pause was issued subsequently clears.

 

The first is more verbose and includes a description of an implementation optimization. The second is less verbose and leaves the detail to the individual implementation.

 

Or there can be a 3rd choice of whatever the workgroup considers appropriate.

 

Pls let me know your feedback.

 

Sanjay

 

Btw, I’m curious about the constraint placed on the use of FENCE with LR/SC. Is it because waiting for the barrier to clear would result in relatively unbounded delay in exiting PAUSE? Further, I assume this is not a concern for PAUSE which leverages FENCE as a hint, since the successor group is 0 and thus the barrier behavior need not be implemented.

 

 

On 1/15/21, 12:16 PM, "krste@..." <krste@...> wrote:

 

 

    >>>>> On Fri, 15 Jan 2021 17:14:23 +0000, Sanjay Patel <spatel@...> said:

 

    | I am partial towards MIPS PAUSE for LR/SC because it is optimized for the specific event that should clear the reservation, the snoop. Further, the response doesn't have to be "sized" as in the case of the RISC-V PAUSE timer, as the response is in response to the number of threads contending for the lock. i.e., the MIPS PAUSE handling just works out of the box. I do agree though that it is specific to LR/SC and not applicable to the AMOs. But that's why I wanted it bi-modal, to handle the two cases.

 

    You can implement the LL-bit as a purely microarchitectural

    optimization for your use case.  For example, you can use two

    different timeouts for the case where LL-bit is set or not.  I believe

    this captures almost all of the benefits of the MIPS approach while

    avoiding adding complexity to the RISC-V ISA spec.

 

    | Is it recommended that the RISC-V timer be programmable? If so, is it better to be an M-mode register? The reason I say M-mode is that if the count can be programmed arbitrarily by any thread, then there is the risk of livelocks since a thread that has a short cycle count will poll the lock more frequently. While the timer is not specified directly in the specification, I think it should as a possible or recommended implementation.

 

    | Also, how does a software programmer assess the value of counter? I would assume it is the latency from the core to the point of shared coherency.

    | Other rules may have to be specified for PAUSE. Does an interrupt cause the timer to be cleared? I think it should.

    | (Btw, the concerns I raise could just be a matter of good engineering. )

 

    Software locking code can add adaptive backoff loop if required and

    with no need to involve either more privileged layers or additional

    hardware (neither of which will understand what application software

    is actually trying to do).

 

    | I'm not clear how the definition I've provided prevents RISC-V from being use as a HINT. HINTS should not modify architecture state, and implementations can choose to NOP the instruction. I don't think I've violated either rule.

 

    The MIPS PAUSE mandates the existence of an LL-bit, and has behavior

    that depends on this LL-bit.  I realize that the MIPS spec does not

    guarantee that threads will stop at a PAUSE with LL-bit set, but this

    is allowed by spec and software could rely on it, or more importantly,

    software cannot rely on PAUSE being exited unless LL-bit is cleared.

    This is why it is not really a HINT, even though a particular hardware

    implementation could implement it as a NOP.

 

    | If you can provide some clarity regarding my questions about RISC-V PAUSE then we might be able to make it work as is. I think it is the ambiguity that drives me towards the familiar MIPS PAUSE.

 

    Hope this helps,

    Krste

 

 

    | Sanjay

 

    | On 1/14/21, 8:31 AM, "krste@..." <krste@...> wrote:

 

 

    |     As far as I can tell, the only advantage of the MIPS-style PAUSE over

    |     the RISC-V proposed PAUSE would be to remove the need for a timeout

    |     counter in the microarchitecture.  But in the bimodal form you've

    |     given, you still need the timeout counter, but have also stopped PAUSE

    |     from being a HINT (and added definition of LL-bit to LR/SC).

 

    |     I don't believe the reduction in polling overhead is a significant

    |     factor, as an implementation can set the default PAUSE duration such

    |     that this is sufficiently low.

 

    |     What is the objection to the PAUSE occasionally causing a retry and a

    |     re-PAUSE when timer expires and LL-bit is set?  This will happen

    |     anyway if there are interrupts.

 

    |     Krste

 

 

 

    |||||| On Mon, 11 Jan 2021 18:01:38 +0000, Sanjay Patel <spatel@...> said:

 

    |     | Hi Krste,

    |     | One last try to try to unify the definitions. 😊 This is to avoid creating a custom WAIT-FOR-X, equivalent to MIPS PAUSE.

 

    |     | I’ve reformatted the MIPS code you had included such that the operation with LL(LR)/SC is clear in the context of PAUSE use.

 

    |     | acquire_lock:

 

    |     |          ll t0, 0(a0)    

 

    |     |          bnez t0, acquire_lock_retry

 

    |     |          addiu t0, t0, 1

 

    |     |          sc t0, 0(a0)

 

    |     |          bnez t0, 10f

 

    |     |          sync

 

    |     | acquire_lock_retry:

 

    |     |          pause

 

    |     |          b acquire_lock

 

    |     | 10:

 

    |     |          Critical region code

 

    |     | release_lock:

 

    |     |          sync

 

    |     |          sw zero, 0(a0)

 

    |     | A MIPS implementation will set a hardware “LLbit” on execution of LL. We can implement a bi-modal RISC-V PAUSE that waits on the snoop if LLbit is

    |     | set. If LLbit is not set, then PAUSE waits on the timer expiry. I do think this is a viable solution as it provides the same guarantee of exit as

    |     | the timer expiry.

 

    |     | On the other hand, if you think it is architecturally messy to implement the bi-modal behavior, then we can implement a custom WAIT-FOR-X for LR/

    |     | SC, and transition this to a RISC-V WAIT-FOR-X when included in the RISC-V architecture.

 

    |     | Sanjay

 

    |     | On 1/8/21, 6:42 PM, "krste@..." <krste@...> wrote:

 

    |     |||||| On Thu, 7 Jan 2021 16:52:35 +0000, "Sanjay Patel" <spatel@...> said:

 

    |     |     | Thanks Krste.

 

    |     |     | This is helpful and we can map MIPS PAUSE behavior to RISC-V PAUSE.

 

    |     |     | <snip>

 

    |     |     | Yes, and you can implement the exact same semantics as MIPS PAUSE to

 

    |     |     |     end PAUSE on LL-bit activity as a microarch optimization, but must

 

    |     |     |     ensure can't PAUSE indefinitely on RISC-V.

 

    |     |     | <snip>

 

    |     |     | Your sentence also implies we need to implement a counter of similar to ensure an exit, if not on the basis of a snoop event.

 

    |     |     | This isn't clear. If the lock is never cleared for some reason, the counter expiring would cause the code to loop again to check the

    |     | reservation, but it would not gracefully terminate.

 

    |     |     | So basically I'd like to understand how the stated requirement helps.

 

    |     |     If the RISC-V PAUSE could hang indefinitely on any use, it would be

 

    |     |     impossible to prove progress of some algorithms, unless there is some

 

    |     |     environmental mechanism guaranteed to cause PAUSE to resume.  These

 

    |     |     environmental mechanisms would be hard to define in a common way

 

    |     |     across all the systems RISC-V is supporting.

 

    |     |     In your particular scenario, where PAUSE has semantics of exiting on

 

    |     |     access to a specified memory location or set of locations, and the

 

    |     |     lock protocol is known, you can build confidence you don't need a

 

    |     |     timeout to ensure progress.

 

    |     |     However, if the PAUSE is in a loop waiting for some condition where it

 

    |     |     is not easily determined by the core the condition has been satisfied

 

    |     |     (e.g., a bit in a device register), then a potentially indefinite

 

    |     |     PAUSE is unusable.

 

    |     |     I'm not opposed to adding specific "WAIT-FOR-X" instructions in

 

    |     |     addition to PAUSE, but they can't replace the use of simple PAUSE when

 

    |     |     waiting for events whose trigger is not available to the core.  The

 

    |     |     WAIT-FOR instructions will also generally have more complex and

 

    |     |     system-specific semantics.

 

    |     |     Krste

 

    |     |     | Sanjay