diff mbox series

pi_stress: Add memory barrier to resolve crash

Message ID 20241024162620.349934-1-jwyatt@redhat.com
State New
Headers show
Series pi_stress: Add memory barrier to resolve crash | expand

Commit Message

John B. Wyatt IV Oct. 24, 2024, 4:26 p.m. UTC
This patch is also an error report seen on RHEL 9 and Fedora 40.
pi_stress crashes if you run this near full cores with one core
(2 threads) free for housekeeping. The crash usually happens at
around 2 hours of pi_stress running. This issue was reproduced
on a RHEL server with 2 sockets of 10 cores/20 threads (total 40
threads) and a Fedora server with 2 sockets of 56 cores/112
threads (total 226 threads).

The pthread_barrier_wait should guarantee that only one
thread at a time interacts with the variables below that
if block.

GCC -O2 optimization rearranges the two increments above the wait
function calls. This causes a race issue found by Helgrind.
You can prove this by commenting out the memory barrier
and compiling with `-O0`. The crash does not happen with -O0.
Thank you to Valentin Schneider <vschneid@redhat.com> for
suggesting about -O2.

Add a memory barrier to force GCC to increment the variables
below the pthread calls. The memory barrier prevented any crashes
for several 24 hours tests. This function depends on C11.

Reported-by: Valgrind's Helgrind tool

Signed-off-by: "John B. Wyatt IV" <jwyatt@redhat.com>
Signed-off-by: "John B. Wyatt IV" <sageofredondo@gmail.com>
---
 src/pi_tests/pi_stress.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

Comments

Crystal Wood Oct. 24, 2024, 8:12 p.m. UTC | #1
On Thu, 2024-10-24 at 12:26 -0400, John B. Wyatt IV wrote:
> @@ -952,12 +953,31 @@ void *high_priority(void *arg)
>  			    ("high_priority[%d]:
> pthread_barrier_wait(finish): %x", p->id, status);
>  			return NULL;
>  		}
> +
> +
> +		/**
> +		 * The pthread_barrier_wait should guarantee that
> only one
> +		 * thread at a time interacts with the variables
> below that
> +		 * if block.

How is that guaranteed?

> +		 *
> +		 * GCC -O2 rearranges the two increments above the
> wait
> +		 * function calls causing a race issue if you run
> this
> +		 * near full cores with one core (2 threads) free
> for
> +		 * housekeeping. This causes a crash at around 2
> hour of
> +		 * running. You can prove this by commenting out the
> barrier
> +		 * and compiling with `-O0`. The crash does not show
> with
> +		 * -O0.
> +		 *
> +		 * Add a memory barrier to force GCC to increment
> the variables
> +		 * below the pthread calls. This funcion depends on
> C11.
> +		 **/
> +		atomic_thread_fence(memory_order_seq_cst);

That's kind of nuts that pthread_barrier_wait() doesn't even act as a
compile barrier (which a simple external function call should do), much
less a proper memory barrier.  In fact I'd go beyond that to call it a
bug, just as if there were a mutex implementation that required users
to do this.

And POSIX agrees:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12

Is it possible that something else is going on here, and it's just
being hidden by changing the timing?  What exactly is the race you're
seeing?  It looks like there should be only one high priority thread
per group, so is it racing with a reader, or watchdog_clear(),
or...?How does the crash happen?  What is the helgrind output?

-Crystal
John Ogness Oct. 24, 2024, 9:20 p.m. UTC | #2
Hi John,

On 2024-10-24, "John B. Wyatt IV" <jwyatt@redhat.com> wrote:
> +		/**
> +		 * The pthread_barrier_wait should guarantee that only one
> +		 * thread at a time interacts with the variables below that
> +		 * if block.

The pthread_barrier_wait() does exactly the opposite. It guarantees that
multiple threads continue at the same time. Taking a look at some users
of @finish_barrier, we see this means that low_priority() will read
p->total while high_priority() is performing the non-atomic operation
p->total++.

> +		 *
> +		 * GCC -O2 rearranges the two increments above the wait
> +		 * function calls causing a race issue if you run this
> +		 * near full cores with one core (2 threads) free for
> +		 * housekeeping. This causes a crash at around 2 hour of
> +		 * running. You can prove this by commenting out the barrier
> +		 * and compiling with `-O0`. The crash does not show with
> +		 * -O0.

Turning off optimization is not a proof. Show us the assembly code.

> +		 *
> +		 * Add a memory barrier to force GCC to increment the variables
> +		 * below the pthread calls. This funcion depends on C11.
> +		 **/

What you are talking about are compiler barriers, which are for forcing
the compiler to obey instruction ordering. Memory barriers are for
guarenteed memory ordering for CPUs at _runtime_.

> +		atomic_thread_fence(memory_order_seq_cst);

A single memory barrier makes no sense. Memory barriers must be paired
because you are ordering memory access between multiple CPUs.

> +
>  		/* update the group stats */
>  		p->total++;

If you want multiple tasks to be able to modify and read these
variables, then either use atomic operations or use locking. That is
what such mechanisms are for.

John Ogness
Wander Lairson Costa Oct. 29, 2024, 10:15 p.m. UTC | #3
On Thu, Oct 24, 2024 at 12:26:14PM -0400, John B. Wyatt IV wrote:
> This patch is also an error report seen on RHEL 9 and Fedora 40.
> pi_stress crashes if you run this near full cores with one core
> (2 threads) free for housekeeping. The crash usually happens at
> around 2 hours of pi_stress running. This issue was reproduced
> on a RHEL server with 2 sockets of 10 cores/20 threads (total 40
> threads) and a Fedora server with 2 sockets of 56 cores/112
> threads (total 226 threads).
> 
> The pthread_barrier_wait should guarantee that only one
> thread at a time interacts with the variables below that
> if block.
> 
> GCC -O2 optimization rearranges the two increments above the wait
> function calls. This causes a race issue found by Helgrind.
> You can prove this by commenting out the memory barrier
> and compiling with `-O0`. The crash does not happen with -O0.
> Thank you to Valentin Schneider <vschneid@redhat.com> for
> suggesting about -O2.
> 
> Add a memory barrier to force GCC to increment the variables
> below the pthread calls. The memory barrier prevented any crashes
> for several 24 hours tests. This function depends on C11.
> 
> Reported-by: Valgrind's Helgrind tool
> 
> Signed-off-by: "John B. Wyatt IV" <jwyatt@redhat.com>
> Signed-off-by: "John B. Wyatt IV" <sageofredondo@gmail.com>
> ---
>  src/pi_tests/pi_stress.c | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/src/pi_tests/pi_stress.c b/src/pi_tests/pi_stress.c
> index 371e906..37023a1 100644
> --- a/src/pi_tests/pi_stress.c
> +++ b/src/pi_tests/pi_stress.c
> @@ -44,6 +44,7 @@
>  #include <stdint.h>
>  #include <inttypes.h>
>  #include <limits.h>
> +#include <stdatomic.h>
>  
>  #include "rt-sched.h"
>  #include "rt-utils.h"
> @@ -952,12 +953,31 @@ void *high_priority(void *arg)
>  			    ("high_priority[%d]: pthread_barrier_wait(finish): %x", p->id, status);
>  			return NULL;
>  		}
> +
> +
> +		/**
> +		 * The pthread_barrier_wait should guarantee that only one
> +		 * thread at a time interacts with the variables below that
> +		 * if block.
> +		 *
> +		 * GCC -O2 rearranges the two increments above the wait
> +		 * function calls causing a race issue if you run this
> +		 * near full cores with one core (2 threads) free for
> +		 * housekeeping. This causes a crash at around 2 hour of
> +		 * running. You can prove this by commenting out the barrier
> +		 * and compiling with `-O0`. The crash does not show with
> +		 * -O0.
> +		 *
> +		 * Add a memory barrier to force GCC to increment the variables
> +		 * below the pthread calls. This funcion depends on C11.
> +		 **/

I agree with Crystal. AFAIU, the compiler should not reorder before
pthread_barrier_wait(). Something odd is going on here. However, I am
not familiar with the code base. What's the repo for this?

Btw, did you look at the generated assembly code?

> +		atomic_thread_fence(memory_order_seq_cst);
> +

If the problem is program order, a simple compiler barrier with acquire
semantics should work:

        atomic_signal_fence(memory_order_acquire);

>  		/* update the group stats */
>  		p->total++;
>  
>  		/* update the watchdog counter */
>  		p->watchdog++;
> -
>  	}
>  	set_shutdown_flag();
>  
> -- 
> 2.47.0
>
John Kacur Oct. 30, 2024, 10:03 p.m. UTC | #4
On Tue, 29 Oct 2024, Wander Lairson Costa wrote:

> On Thu, Oct 24, 2024 at 12:26:14PM -0400, John B. Wyatt IV wrote:
> > This patch is also an error report seen on RHEL 9 and Fedora 40.
> > pi_stress crashes if you run this near full cores with one core
> > (2 threads) free for housekeeping. The crash usually happens at
> > around 2 hours of pi_stress running. This issue was reproduced
> > on a RHEL server with 2 sockets of 10 cores/20 threads (total 40
> > threads) and a Fedora server with 2 sockets of 56 cores/112
> > threads (total 226 threads).
> > 
> > The pthread_barrier_wait should guarantee that only one
> > thread at a time interacts with the variables below that
> > if block.
> > 
> > GCC -O2 optimization rearranges the two increments above the wait
> > function calls. This causes a race issue found by Helgrind.
> > You can prove this by commenting out the memory barrier
> > and compiling with `-O0`. The crash does not happen with -O0.
> > Thank you to Valentin Schneider <vschneid@redhat.com> for
> > suggesting about -O2.
> > 
> > Add a memory barrier to force GCC to increment the variables
> > below the pthread calls. The memory barrier prevented any crashes
> > for several 24 hours tests. This function depends on C11.
> > 
> > Reported-by: Valgrind's Helgrind tool
> > 
> > Signed-off-by: "John B. Wyatt IV" <jwyatt@redhat.com>
> > Signed-off-by: "John B. Wyatt IV" <sageofredondo@gmail.com>
> > ---
> >  src/pi_tests/pi_stress.c | 22 +++++++++++++++++++++-
> >  1 file changed, 21 insertions(+), 1 deletion(-)
> > 
> > diff --git a/src/pi_tests/pi_stress.c b/src/pi_tests/pi_stress.c
> > index 371e906..37023a1 100644
> > --- a/src/pi_tests/pi_stress.c
> > +++ b/src/pi_tests/pi_stress.c
> > @@ -44,6 +44,7 @@
> >  #include <stdint.h>
> >  #include <inttypes.h>
> >  #include <limits.h>
> > +#include <stdatomic.h>
> >  
> >  #include "rt-sched.h"
> >  #include "rt-utils.h"
> > @@ -952,12 +953,31 @@ void *high_priority(void *arg)
> >  			    ("high_priority[%d]: pthread_barrier_wait(finish): %x", p->id, status);
> >  			return NULL;
> >  		}
> > +
> > +
> > +		/**
> > +		 * The pthread_barrier_wait should guarantee that only one
> > +		 * thread at a time interacts with the variables below that
> > +		 * if block.
> > +		 *
> > +		 * GCC -O2 rearranges the two increments above the wait
> > +		 * function calls causing a race issue if you run this
> > +		 * near full cores with one core (2 threads) free for
> > +		 * housekeeping. This causes a crash at around 2 hour of
> > +		 * running. You can prove this by commenting out the barrier
> > +		 * and compiling with `-O0`. The crash does not show with
> > +		 * -O0.
> > +		 *
> > +		 * Add a memory barrier to force GCC to increment the variables
> > +		 * below the pthread calls. This funcion depends on C11.
> > +		 **/
> 
> I agree with Crystal. AFAIU, the compiler should not reorder before
> pthread_barrier_wait(). Something odd is going on here. However, I am
> not familiar with the code base. What's the repo for this?

This is one of the programs in the rt-tests suite

Clone one of the following
git://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
https://kernel.googlesource.com/pub/scm/utils/rt-tests/rt-tests.git


> 
> Btw, did you look at the generated assembly code?
> 
> > +		atomic_thread_fence(memory_order_seq_cst);
> > +
> 
> If the problem is program order, a simple compiler barrier with acquire
> semantics should work:
> 
>         atomic_signal_fence(memory_order_acquire);
> 
> >  		/* update the group stats */
> >  		p->total++;
> >  
> >  		/* update the watchdog counter */
> >  		p->watchdog++;
> > -
> >  	}
> >  	set_shutdown_flag();
> >  
> > -- 
> > 2.47.0
> > 
> 
> 
>
diff mbox series

Patch

diff --git a/src/pi_tests/pi_stress.c b/src/pi_tests/pi_stress.c
index 371e906..37023a1 100644
--- a/src/pi_tests/pi_stress.c
+++ b/src/pi_tests/pi_stress.c
@@ -44,6 +44,7 @@ 
 #include <stdint.h>
 #include <inttypes.h>
 #include <limits.h>
+#include <stdatomic.h>
 
 #include "rt-sched.h"
 #include "rt-utils.h"
@@ -952,12 +953,31 @@  void *high_priority(void *arg)
 			    ("high_priority[%d]: pthread_barrier_wait(finish): %x", p->id, status);
 			return NULL;
 		}
+
+
+		/**
+		 * The pthread_barrier_wait should guarantee that only one
+		 * thread at a time interacts with the variables below that
+		 * if block.
+		 *
+		 * GCC -O2 rearranges the two increments above the wait
+		 * function calls causing a race issue if you run this
+		 * near full cores with one core (2 threads) free for
+		 * housekeeping. This causes a crash at around 2 hour of
+		 * running. You can prove this by commenting out the barrier
+		 * and compiling with `-O0`. The crash does not show with
+		 * -O0.
+		 *
+		 * Add a memory barrier to force GCC to increment the variables
+		 * below the pthread calls. This funcion depends on C11.
+		 **/
+		atomic_thread_fence(memory_order_seq_cst);
+
 		/* update the group stats */
 		p->total++;
 
 		/* update the watchdog counter */
 		p->watchdog++;
-
 	}
 	set_shutdown_flag();