[v2] locking/rwbase: Prevent indefinite writer starvation

Message ID	20230117083817.togfwc5cy4g67e5r@techsingularity.net
State	New
Headers	show Return-Path: <linux-rt-users-owner@vger.kernel.org> Date: Tue, 17 Jan 2023 08:38:17 +0000 From: Mel Gorman <mgorman@techsingularity.net> To: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@kernel.org>, Davidlohr Bueso <dave@stgolabs.net>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Linux-RT <linux-rt-users@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org> Subject: [PATCH v2] locking/rwbase: Prevent indefinite writer starvation Message-ID: <20230117083817.togfwc5cy4g67e5r@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline Precedence: bulk
Series	[v2] locking/rwbase: Prevent indefinite writer starvation \| expand [v2] locking/rwbase: Prevent indefinite writer starvation

Mel Gorman Jan. 17, 2023, 8:38 a.m. UTC

rw_semaphore and rwlock are explicitly unfair to writers in the presense
of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;

        The implementation is writer unfair, as it is not feasible to do
        priority inheritance on multiple readers, but experience has shown
        that real-time workloads are not the typical workloads which are
        sensitive to writer starvation.

While atypical, it's also trivial to block writers with PREEMPT_RT
indefinitely without ever making forward progress. Since LTP-20220121,
the dio_truncate test case went from having 1 reader to having 16 readers
and the number of readers is sufficient to prevent the down_write ever
succeeding while readers exist. Eventually the test is killed after 30
minutes as a failure.

dio_truncate is not a realtime application but indefinite writer starvation
is undesirable. The test case has one writer appending and truncating files
A and B while multiple readers read file A. The readers and writer are
contending for one file's inode lock which never succeeds as the readers
keep reading until the writer is done which never happens.

This patch records a timestamp when the first writer is blocked if no
deadline or realtime task has recently acquired the lock for read. If
dt/rt tasks are involved, then reader bias is preserved. For other tasks,
reader bias is allowed until the first writer has been blocked for a minimum
of 4ms or 1 tick. The cutoff time is arbitrary on the assumption that a
normal application contending for 4ms also does not need PREEMPT_RT. On
a test machine, the test completed in 88 seconds.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/rwbase_rt.h  |  3 ++
 kernel/locking/rwbase_rt.c | 84 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 82 insertions(+), 5 deletions(-)

Mel Gorman Jan. 17, 2023, 4:50 p.m. UTC | #1

On Tue, Jan 17, 2023 at 03:22:30PM +0100, Sebastian Andrzej Siewior wrote:
> On 2023-01-17 08:38:17 [+0000], Mel Gorman wrote:
> > rw_semaphore and rwlock are explicitly unfair to writers in the presense
> > of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
> > ("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;
> > 
> >         The implementation is writer unfair, as it is not feasible to do
> >         priority inheritance on multiple readers, but experience has shown
> >         that real-time workloads are not the typical workloads which are
> >         sensitive to writer starvation.
> > 
> > While atypical, it's also trivial to block writers with PREEMPT_RT
> > indefinitely without ever making forward progress. Since LTP-20220121,
> > the dio_truncate test case went from having 1 reader to having 16 readers
> > and the number of readers is sufficient to prevent the down_write ever
> > succeeding while readers exist. Eventually the test is killed after 30
> > minutes as a failure.
> > 
> > dio_truncate is not a realtime application but indefinite writer starvation
> 
> If so then the PI boosting would not work if we would have it ;)
> 

True, but it's still undesirable for a basic functional test using normal
tasks with no prioritisation to stall forever.

> > is undesirable. The test case has one writer appending and truncating files
> > A and B while multiple readers read file A. The readers and writer are
> > contending for one file's inode lock which never succeeds as the readers
> > keep reading until the writer is done which never happens.
> 
> This tests the implementation of rwsem/ rwlock functionality to ensure
> that it is not writer unfair.
> 

Also true but in this case at least, the test's tasks do not care about
priority inheritance.

> > This patch records a timestamp when the first writer is blocked if no
> > deadline or realtime task has recently acquired the lock for read. If
> > dt/rt tasks are involved, then reader bias is preserved. For other tasks,
> DL/ RT. Would it work to use the capital letters if it refers to the
> scheduling class?
> 

Sure.

> > reader bias is allowed until the first writer has been blocked for a minimum
> > of 4ms or 1 tick. The cutoff time is arbitrary on the assumption that a
> > normal application contending for 4ms also does not need PREEMPT_RT. On
> > a test machine, the test completed in 88 seconds.
> 
> I would go for one second just because it _usually_ does not matter
> since none of the important locks rely on that (as stated in the commit
> message). But then why not use the 4ms/ 1 tick as you suggest. This is
> after all what the NON-PREEMPT_RT implementation is using to ensure that
> the writer is not stalled infinitely. The RWLOCK implementation is
> already writer unfair.
> 

1 second is a relatively long time. The test would eventually complete
but not before the 30 minute timeout hits and kills the test anyway.
It makes more sense to explicitly make it similar to RWSEM_WAIT_TIMEOUT.

> Side note: If the test case gets updated to RT reader which acquire the
> lock (the whole time) then they will block writer again :)
> 

True, but in that case a RT-specific test showed that the lock is
writer unfair and no one is surprised. It's the test design equivalent
of "stop hitting yourself" until the lock is writer fair assuming that
happens.

> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  include/linux/rwbase_rt.h  |  3 ++
> >  kernel/locking/rwbase_rt.c | 84 +++++++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 82 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/rwbase_rt.h b/include/linux/rwbase_rt.h
> > index 1d264dd08625..05c4dc74b8bd 100644
> > --- a/include/linux/rwbase_rt.h
> > +++ b/include/linux/rwbase_rt.h
> > @@ -10,12 +10,14 @@
> >  
> >  struct rwbase_rt {
> >  	atomic_t		readers;
> > +	unsigned long		waiter_blocked;
> >  	struct rt_mutex_base	rtmutex;
> >  };
> >  
> >  #define __RWBASE_INITIALIZER(name)				\
> >  {								\
> >  	.readers = ATOMIC_INIT(READER_BIAS),			\
> > +	.waiter_blocked = 0,					\
> >  	.rtmutex = __RT_MUTEX_BASE_INITIALIZER(name.rtmutex),	\
> >  }
> >  
> > @@ -23,6 +25,7 @@ struct rwbase_rt {
> >  	do {							\
> >  		rt_mutex_base_init(&(rwbase)->rtmutex);		\
> >  		atomic_set(&(rwbase)->readers, READER_BIAS);	\
> > +		(rwbase)->waiter_blocked = 0;			\
> >  	} while (0)
> >  
> >  
> > diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
> > index c201aadb9301..db2f6accf49f 100644
> > --- a/kernel/locking/rwbase_rt.c
> > +++ b/kernel/locking/rwbase_rt.c
> > @@ -39,7 +39,11 @@
> >   * major surgery for a very dubious value.
> >   *
> >   * The risk of writer starvation is there, but the pathological use cases
> > - * which trigger it are not necessarily the typical RT workloads.
> > + * which trigger it are not necessarily the typical RT workloads. The worst
> > + * case of indefinite starvation of a writer will force readers into the
> > + * slow path if a writer is blocked for more than RW_CONTENTION_THRESHOLD
> > + * jiffies unless dl/rt tasks have taken a read lock since the last write
>
> DL/RT please.
> 

Done.

> > + * unlock.
> >   *
> >   * Fast-path orderings:
> >   * The lock/unlock of readers can run in fast paths: lock and unlock are only
> > @@ -65,6 +69,61 @@ static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Allow reader bias with a pending writer for a minimum of 4ms or 1 tick.
> 
>     * This matches RWSEM_WAIT_TIMEOUT for the generic RWSEM
>     * implementation.
> 

Done, but I also renamed the threshold to RWBASE_RT_WAIT_TIMEOUT.

> > + * The granularity is not exact as the lowest bit in rwbase_rt->waiter_blocked
> > + * is used to detect recent rt/dl tasks taking a read lock.
> > + */
> > +#define RW_CONTENTION_THRESHOLD (HZ/250+1)
> 				   DIV_ROUND_UP(HZ, 250)
> 
> > +static void __sched update_dlrt_reader(struct rwbase_rt *rwb)
> > +{
> > +	/* No update required if dl/rt tasks already identified. */
> > +	if (rwb->waiter_blocked & 1)
> > +		return;
> > +
> > +	/*
> > +	 * Record a dl/rt task acquiring the lock for read. This may result
> DL/RT
> > +	 * in indefinite writer starvation but dl/rt tasks should avoid such
> > +	 * behaviour.
> > +	 */
> > +	if (dl_task(current) || rt_task(current)) {
> 
> There is also task_is_realtime(). But using only rt_task() should work
> since it also covers dl_task().

True, changed to rt_task only. 

> 
> > +		struct rt_mutex_base *rtm = &rwb->rtmutex;
> > +		unsigned long flags;
> > +
> > +		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
> > +		rwb->waiter_blocked |= 1;
> > +		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
> > +	}
> > +}
> > +
> > +/* rtmutex->wait_lock must be held. */
> > +static void __sched set_writer_blocked(struct rwbase_rt *rwb)
> > +{
> > +	/*
> > +	 * Lowest bit preserved to identify recent rt/dl tasks acquiring
> > +	 * the lock for read so guarantee at least one tick delay.
> > +	 */
> > +	rwb->waiter_blocked |= (jiffies + 2) & ~1UL;
> 
> I'm unsure what |= means in terms of multiple writers. It seems to
> extend the wait period and the second writer has none after the first
> one leaves.
> 

That's an oversight as v1 only covered the first writer. The unguarded
check turns into garbage for multiple writers.

> > +}
> > +
> > +static bool __sched rwbase_allow_reader_bias(struct rwbase_rt *rwb)
> > +{
> > +	/*
> > +	 * Allow reader bias if a dl or rt task took the lock for read
> > +	 * since the last write unlock. Such tasks should be designed
> > +	 * to avoid heavy writer contention or indefinite starvation.
> > +	 */
> > +	if (rwb->waiter_blocked & 1)
> > +		return true;
> > +
> > +	/*
> > +	 * Allow reader bias unless a writer has been blocked for more
> > +	 * than RW_CONTENTION_THRESHOLD jiffies.
> > +	 */
> > +	return jiffies - rwb->waiter_blocked < RW_CONTENTION_THRESHOLD;
> 
> if you set
> 	rwb->waiter_blocked = jiffies + RW_CONTENTION_THRESHOLD
> 
> then you could use
> 	time_after(jiffies, waiter->waiter_blocked)
> 
> and we could name it timeout. So the first writer sets it and my guess
> would be that the each RT reader ignores this delay while every non-RT
> tries to acquire the lock unless as long as the timeout did not occur.
> Then they back off and wait for one writer to acquire the lock.
> I don't know what we do with the possible second writer but I guess
> first writer on unlock should reset the timeout for the next writer. So
> we have again reader followed by writer.
> 

The second writer was an oversight. I think it's better to keep it simple
and let the first writer blocked determine when reader bias finishes.
The timeout resets when the write lock is released.

Thanks Sebastian. Based on your feedback, v3 looks like this

--8<--
locking/rwbase: Prevent indefinite writer starvation

rw_semaphore and rwlock are explicitly unfair to writers in the presense
of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;

        The implementation is writer unfair, as it is not feasible to do
        priority inheritance on multiple readers, but experience has shown
        that real-time workloads are not the typical workloads which are
        sensitive to writer starvation.

While atypical, it's also trivial to block writers with PREEMPT_RT
indefinitely without ever making forward progress. Since LTP-20220121,
the dio_truncate test case went from having 1 reader to having 16 readers
and the number of readers is sufficient to prevent the down_write ever
succeeding while readers exist. Eventually the test is killed after 30
minutes as a failure.

dio_truncate is not a realtime application but indefinite writer starvation
is undesirable. The test case has one writer appending and truncating files
A and B while multiple readers read file A. The readers and writer are
contending for one file's inode lock which never succeeds as the readers
keep reading until the writer is done which never happens.

This patch records a timestamp when the first writer is blocked if no
deadline or realtime task has recently acquired the lock for read. If DT / RT
tasks are involved, then reader bias priority inheritance is preserved. For
other tasks, reader bias is allowed until the first writer has been blocked
for a minimum of 4ms similar to RWSEM_WAIT_TIMEOUT.  This is sufficient
to allow the test case to complete within the 30 minutes timeout.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/rwbase_rt.h  |  3 ++
 kernel/locking/rwbase_rt.c | 87 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/include/linux/rwbase_rt.h b/include/linux/rwbase_rt.h
index 1d264dd08625..b969b1d9bb85 100644
--- a/include/linux/rwbase_rt.h
+++ b/include/linux/rwbase_rt.h
@@ -10,12 +10,14 @@
 
 struct rwbase_rt {
 	atomic_t		readers;
+	unsigned long		waiter_timeout;
 	struct rt_mutex_base	rtmutex;
 };
 
 #define __RWBASE_INITIALIZER(name)				\
 {								\
 	.readers = ATOMIC_INIT(READER_BIAS),			\
+	.waiter_timeout = 0,					\
 	.rtmutex = __RT_MUTEX_BASE_INITIALIZER(name.rtmutex),	\
 }
 
@@ -23,6 +25,7 @@ struct rwbase_rt {
 	do {							\
 		rt_mutex_base_init(&(rwbase)->rtmutex);		\
 		atomic_set(&(rwbase)->readers, READER_BIAS);	\
+		(rwbase)->waiter_timeout = 0;			\
 	} while (0)
 
 
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index c201aadb9301..99d81e8d1f25 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -39,7 +39,11 @@
  * major surgery for a very dubious value.
  *
  * The risk of writer starvation is there, but the pathological use cases
- * which trigger it are not necessarily the typical RT workloads.
+ * which trigger it are not necessarily the typical RT workloads. The worst
+ * case of indefinite starvation of a writer will force readers into the
+ * slow path if a writer is blocked for more than RWBASE_RT_WAIT_TIMEOUT
+ * jiffies unless DL / RT tasks have taken a read lock since the last write
+ * unlock.
  *
  * Fast-path orderings:
  * The lock/unlock of readers can run in fast paths: lock and unlock are only
@@ -65,6 +69,64 @@ static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
 	return 0;
 }
 
+/*
+ * Allow reader bias with a pending writer for a minimum of 4ms or 1 tick.
+ * This matches RWSEM_WAIT_TIMEOUT for the generic RWSEM implementation.
+ * The granularity is not exact as the lowest bit in rwbase_rt->waiter_timeout
+ * is used to detect recent DL / RT tasks taking a read lock.
+ */
+#define RWBASE_RT_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
+
+static void __sched update_dlrt_reader(struct rwbase_rt *rwb)
+{
+	/* No update required if DL / RT tasks already identified. */
+	if (rwb->waiter_timeout & 1)
+		return;
+
+	/*
+	 * Record a DL / RT task acquiring the lock for read. This may result
+	 * in indefinite writer starvation but DL / RT tasks should avoid such
+	 * behaviour.
+	 */
+	if (rt_task(current)) {
+		struct rt_mutex_base *rtm = &rwb->rtmutex;
+		unsigned long flags;
+
+		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
+		rwb->waiter_timeout |= 1;
+		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
+	}
+}
+
+/* rtmutex->wait_lock must be held. */
+static void __sched set_writer_blocked(struct rwbase_rt *rwb)
+{
+	/*
+	 * Record the timeout based on the the first writer to block. The
+	 * lowest bit is used to identify recent DL / RT tasks acquiring the
+	 * lock for read so guarantee at least one tick delay.
+	 */
+	if (rwb->waiter_timeout <= 1) {
+		unsigned long timeout = jiffies + RWBASE_RT_WAIT_TIMEOUT + 2;
+
+		rwb->waiter_timeout |= timeout & ~1UL;
+	}
+}
+
+static bool __sched rwbase_allow_reader_bias(struct rwbase_rt *rwb)
+{
+	/*
+	 * Allow reader bias if a DL / RT task took the lock for read
+	 * since the last write unlock. Such tasks should be designed
+	 * to avoid heavy writer contention or indefinite starvation.
+	 */
+	if (rwb->waiter_timeout & 1)
+		return true;
+
+	/* Allow reader bias unless a writer timeout is reached. */
+	return time_before(jiffies, rwb->waiter_timeout);
+}
+
 static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 				      unsigned int state)
 {
@@ -74,9 +136,11 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	raw_spin_lock_irq(&rtm->wait_lock);
 	/*
 	 * Allow readers, as long as the writer has not completely
-	 * acquired the semaphore for write.
+	 * acquired the semaphore for write and reader bias is still
+	 * allowed.
 	 */
-	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
+	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
+	    rwbase_allow_reader_bias(rwb)) {
 		atomic_inc(&rwb->readers);
 		raw_spin_unlock_irq(&rtm->wait_lock);
 		return 0;
@@ -140,10 +204,18 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 static __always_inline int rwbase_read_lock(struct rwbase_rt *rwb,
 					    unsigned int state)
 {
+	int ret;
+
 	if (rwbase_read_trylock(rwb))
-		return 0;
+		ret = 0;
+	else
+		ret = __rwbase_read_lock(rwb, state);
+
+	/* Record if the current task acquiring the lock is a DL / RT task. */
+	if (!ret)
+		update_dlrt_reader(rwb);
 
-	return __rwbase_read_lock(rwb, state);
+	return ret;
 }
 
 static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
@@ -264,12 +336,17 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 		if (__rwbase_write_trylock(rwb))
 			break;
 
+		/* Record first new read/write contention. */
+		set_writer_blocked(rwb);
+
 		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
 		rwbase_schedule();
 		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
 
 		set_current_state(state);
 	}
+
+	rwb->waiter_timeout = 0;
 	rwbase_restore_current_state();
 	trace_contention_end(rwb, 0);

Ingo Molnar Jan. 18, 2023, 10:45 a.m. UTC | #2

* Mel Gorman <mgorman@techsingularity.net> wrote:

> > > dio_truncate is not a realtime application but indefinite writer 
> > > starvation
> > 
> > If so then the PI boosting would not work if we would have it ;)
> > 
> 
> True, but it's still undesirable for a basic functional test using normal 
> tasks with no prioritisation to stall forever.

Agreed.

> +/*
> + * Allow reader bias with a pending writer for a minimum of 4ms or 1 tick.
> + * This matches RWSEM_WAIT_TIMEOUT for the generic RWSEM implementation.
> + * The granularity is not exact as the lowest bit in rwbase_rt->waiter_timeout
> + * is used to detect recent DL / RT tasks taking a read lock.
> + */
> +#define RWBASE_RT_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
> +
> +static void __sched update_dlrt_reader(struct rwbase_rt *rwb)
> +{
> +	/* No update required if DL / RT tasks already identified. */
> +	if (rwb->waiter_timeout & 1)
> +		return;
> +
> +	/*
> +	 * Record a DL / RT task acquiring the lock for read. This may result
> +	 * in indefinite writer starvation but DL / RT tasks should avoid such
> +	 * behaviour.
> +	 */
> +	if (rt_task(current)) {
> +		struct rt_mutex_base *rtm = &rwb->rtmutex;
> +		unsigned long flags;
> +
> +		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
> +		rwb->waiter_timeout |= 1;
> +		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
> +	}
> +}

So I'm not sure this should be dependent on the task being an RT task.

Starvation scenarios are bad no matter what scheduling policy is used.

Should be unconditional - and all workloads should live with the new 
behavior.

Thanks,

	Ingo

Sebastian Andrzej Siewior Jan. 18, 2023, 3:25 p.m. UTC | #3

On 2023-01-17 16:50:21 [+0000], Mel Gorman wrote:

> diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
> index c201aadb9301..99d81e8d1f25 100644
> --- a/kernel/locking/rwbase_rt.c
> +++ b/kernel/locking/rwbase_rt.c
> @@ -65,6 +69,64 @@ static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
>  	return 0;
>  }
>  
> +/*
> + * Allow reader bias with a pending writer for a minimum of 4ms or 1 tick.
> + * This matches RWSEM_WAIT_TIMEOUT for the generic RWSEM implementation.
> + * The granularity is not exact as the lowest bit in rwbase_rt->waiter_timeout
> + * is used to detect recent DL / RT tasks taking a read lock.
> + */
> +#define RWBASE_RT_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
> +
> +static void __sched update_dlrt_reader(struct rwbase_rt *rwb)
> +{
> +	/* No update required if DL / RT tasks already identified. */
> +	if (rwb->waiter_timeout & 1)
> +		return;
> +
> +	/*
> +	 * Record a DL / RT task acquiring the lock for read. This may result
> +	 * in indefinite writer starvation but DL / RT tasks should avoid such
> +	 * behaviour.
> +	 */
> +	if (rt_task(current)) {
> +		struct rt_mutex_base *rtm = &rwb->rtmutex;
> +		unsigned long flags;
> +
> +		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
> +		rwb->waiter_timeout |= 1;

Let me see of I parsed the whole logic right:

_After_ the RT reader acquired the lock, the lowest bit is set. This may
be immediately if the timeout did not occur yet.
With this flag set, all following reader incl. SCHED_OTHER will acquire
the lock.

If so, then I don't know why this is a good idea.

If _only_ the RT reader is allowed to acquire the lock while the writer
is waiting then it make sense to prefer the RT tasks. (So the check is
on current and not on the lowest bit).
All other (SCHED_OTHER) reader would have to block on the rtmutex after
the timeout. This makes sense to avoid the starvation.

If we drop that "we prefer the RT reader" then it would block on the
RTmutex. It will _still_ be preferred over the writer because it will be
enqueued before the writer in the queue due to its RT priority. The only
downside is that it has to wait until all readers are left.
So by allowing the RT reader to always acquire the lock as long as the
WRITER_BIAS isn't set, we would allow to enter early while the other
reader are still in and after the timeout you would only have RT reader
going in and out. All SCHED_OTHER reader block on the RTmutex.

I think I like this.

> +		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
> +	}
> +}
> +

Sebastian

Mel Gorman Jan. 18, 2023, 4 p.m. UTC | #4

On Wed, Jan 18, 2023 at 11:45:45AM +0100, Ingo Molnar wrote:
> 
> > +/*
> > + * Allow reader bias with a pending writer for a minimum of 4ms or 1 tick.
> > + * This matches RWSEM_WAIT_TIMEOUT for the generic RWSEM implementation.
> > + * The granularity is not exact as the lowest bit in rwbase_rt->waiter_timeout
> > + * is used to detect recent DL / RT tasks taking a read lock.
> > + */
> > +#define RWBASE_RT_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
> > +
> > +static void __sched update_dlrt_reader(struct rwbase_rt *rwb)
> > +{
> > +	/* No update required if DL / RT tasks already identified. */
> > +	if (rwb->waiter_timeout & 1)
> > +		return;
> > +
> > +	/*
> > +	 * Record a DL / RT task acquiring the lock for read. This may result
> > +	 * in indefinite writer starvation but DL / RT tasks should avoid such
> > +	 * behaviour.
> > +	 */
> > +	if (rt_task(current)) {
> > +		struct rt_mutex_base *rtm = &rwb->rtmutex;
> > +		unsigned long flags;
> > +
> > +		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
> > +		rwb->waiter_timeout |= 1;
> > +		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
> > +	}
> > +}
> 
> So I'm not sure this should be dependent on the task being an RT task.
> 
> Starvation scenarios are bad no matter what scheduling policy is used.
> 
> Should be unconditional - and all workloads should live with the new 
> behavior.
> 

The DL / RT task special casing was based on feedback given here
https://lore.kernel.org/r/Y7wxjBN9bDaZ0BKo@hirez.programming.kicks-ass.net.
Based on that, I assumed that allowing write to blocks readers that
may be depending on priority inheritance is potentially problematic for
applications that likely have been designed with writer-starvation in mind.
The first version of the patch did not care about the scheduling classes
were but I admit there is a non-zero possibility that breaking reader bias
for a writer may break some unknown RT-specific application that relied
on writer starvation for DL / RT tasks.

Mel Gorman Jan. 18, 2023, 5:31 p.m. UTC | #5

On Wed, Jan 18, 2023 at 04:25:57PM +0100, Sebastian Andrzej Siewior wrote:
> On 2023-01-17 16:50:21 [+0000], Mel Gorman wrote:
> 
> > diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
> > index c201aadb9301..99d81e8d1f25 100644
> > --- a/kernel/locking/rwbase_rt.c
> > +++ b/kernel/locking/rwbase_rt.c
> > @@ -65,6 +69,64 @@ static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Allow reader bias with a pending writer for a minimum of 4ms or 1 tick.
> > + * This matches RWSEM_WAIT_TIMEOUT for the generic RWSEM implementation.
> > + * The granularity is not exact as the lowest bit in rwbase_rt->waiter_timeout
> > + * is used to detect recent DL / RT tasks taking a read lock.
> > + */
> > +#define RWBASE_RT_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
> > +
> > +static void __sched update_dlrt_reader(struct rwbase_rt *rwb)
> > +{
> > +	/* No update required if DL / RT tasks already identified. */
> > +	if (rwb->waiter_timeout & 1)
> > +		return;
> > +
> > +	/*
> > +	 * Record a DL / RT task acquiring the lock for read. This may result
> > +	 * in indefinite writer starvation but DL / RT tasks should avoid such
> > +	 * behaviour.
> > +	 */
> > +	if (rt_task(current)) {
> > +		struct rt_mutex_base *rtm = &rwb->rtmutex;
> > +		unsigned long flags;
> > +
> > +		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
> > +		rwb->waiter_timeout |= 1;
> 
> Let me see of I parsed the whole logic right:
> 
> _After_ the RT reader acquired the lock, the lowest bit is set. This may
> be immediately if the timeout did not occur yet.
> With this flag set, all following reader incl. SCHED_OTHER will acquire
> the lock.
> 

Correct. The intent was to track if there were any DL / RT tasks since
the last write unlock in case there was a mix of SCHED_OTHER and rt_tasks
contending on the same lock with unknown arrival times.

> If so, then I don't know why this is a good idea.
> 
> If _only_ the RT reader is allowed to acquire the lock while the writer
> is waiting then it make sense to prefer the RT tasks. (So the check is
> on current and not on the lowest bit).
> All other (SCHED_OTHER) reader would have to block on the rtmutex after
> the timeout. This makes sense to avoid the starvation.
> 

Ok, fair enough.

> If we drop that "we prefer the RT reader" then it would block on the
> RTmutex. It will _still_ be preferred over the writer because it will be
> enqueued before the writer in the queue due to its RT priority. The only
> downside is that it has to wait until all readers are left.

The writer has to wait until all the readers have left anyway.

> So by allowing the RT reader to always acquire the lock as long as the
> WRITER_BIAS isn't set, we would allow to enter early while the other
> reader are still in and after the timeout you would only have RT reader
> going in and out. All SCHED_OTHER reader block on the RTmutex.
> 
> I think I like this.
> 

If I understand you correctly, the patch becomes this;

--8<--
locking/rwbase: Prevent indefinite writer starvation

rw_semaphore and rwlock are explicitly unfair to writers in the presense
of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;

        The implementation is writer unfair, as it is not feasible to do
        priority inheritance on multiple readers, but experience has shown
        that real-time workloads are not the typical workloads which are
        sensitive to writer starvation.

While atypical, it's also trivial to block writers with PREEMPT_RT
indefinitely without ever making forward progress. Since LTP-20220121,
the dio_truncate test case went from having 1 reader to having 16 readers
and the number of readers is sufficient to prevent the down_write ever
succeeding while readers exist. Eventually the test is killed after 30
minutes as a failure.

dio_truncate is not a realtime application but indefinite writer starvation
is undesirable. The test case has one writer appending and truncating files
A and B while multiple readers read file A. The readers and writer are
contending for one file's inode lock which never succeeds as the readers
keep reading until the writer is done which never happens.

This patch records a timestamp when the first writer is blocked. DT /
RT tasks can continue to take the lock for read as long as readers exist
indefinitely. Other readers can acquire the read lock unless a writer
has been blocked for a minimum of 4ms. This is sufficient to allow the
dio_truncate test case to complete within the 30 minutes timeout.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/rwbase_rt.h  |  3 +++
 kernel/locking/rwbase_rt.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/include/linux/rwbase_rt.h b/include/linux/rwbase_rt.h
index 1d264dd08625..b969b1d9bb85 100644
--- a/include/linux/rwbase_rt.h
+++ b/include/linux/rwbase_rt.h
@@ -10,12 +10,14 @@
 
 struct rwbase_rt {
 	atomic_t		readers;
+	unsigned long		waiter_timeout;
 	struct rt_mutex_base	rtmutex;
 };
 
 #define __RWBASE_INITIALIZER(name)				\
 {								\
 	.readers = ATOMIC_INIT(READER_BIAS),			\
+	.waiter_timeout = 0,					\
 	.rtmutex = __RT_MUTEX_BASE_INITIALIZER(name.rtmutex),	\
 }
 
@@ -23,6 +25,7 @@ struct rwbase_rt {
 	do {							\
 		rt_mutex_base_init(&(rwbase)->rtmutex);		\
 		atomic_set(&(rwbase)->readers, READER_BIAS);	\
+		(rwbase)->waiter_timeout = 0;			\
 	} while (0)
 
 
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index c201aadb9301..84c5e4e4d25b 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -39,7 +39,10 @@
  * major surgery for a very dubious value.
  *
  * The risk of writer starvation is there, but the pathological use cases
- * which trigger it are not necessarily the typical RT workloads.
+ * which trigger it are not necessarily the typical RT workloads. SCHED_OTHER
+ * reader acquisitions will be forced into the slow path if a writer is
+ * blocked for more than RWBASE_RT_WAIT_TIMEOUT jiffies. New DL / RT readers
+ * can still starve a writer indefinitely.
  *
  * Fast-path orderings:
  * The lock/unlock of readers can run in fast paths: lock and unlock are only
@@ -65,6 +68,35 @@ static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
 	return 0;
 }
 
+/*
+ * Allow reader bias for SCHED_OTHER tasks with a pending writer for a
+ * minimum of 4ms or 1 tick. This matches RWSEM_WAIT_TIMEOUT for the
+ * generic RWSEM implementation.
+ */
+#define RWBASE_RT_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
+
+/* rtmutex->wait_lock must be held. */
+static void __sched set_writer_blocked(struct rwbase_rt *rwb)
+{
+	/* Record the timeout based on the the first writer to block. */
+	if (!rwb->waiter_timeout)
+		rwb->waiter_timeout = jiffies + RWBASE_RT_WAIT_TIMEOUT;
+}
+
+static bool __sched rwbase_allow_reader_bias(struct rwbase_rt *rwb)
+{
+	/*
+	 * Allow reader bias for DL / RT tasks. Such tasks should be
+	 * designed to avoid heavy writer contention or indefinite
+	 * starvation.
+	 */
+	if (rt_task(current))
+		return true;
+
+	/* Allow reader bias unless a writer timeout is reached. */
+	return time_before(jiffies, rwb->waiter_timeout);
+}
+
 static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 				      unsigned int state)
 {
@@ -74,9 +106,11 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	raw_spin_lock_irq(&rtm->wait_lock);
 	/*
 	 * Allow readers, as long as the writer has not completely
-	 * acquired the semaphore for write.
+	 * acquired the semaphore for write and reader bias is still
+	 * allowed.
 	 */
-	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
+	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
+	    rwbase_allow_reader_bias(rwb)) {
 		atomic_inc(&rwb->readers);
 		raw_spin_unlock_irq(&rtm->wait_lock);
 		return 0;
@@ -264,12 +298,17 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 		if (__rwbase_write_trylock(rwb))
 			break;
 
+		/* Record first new read/write contention. */
+		set_writer_blocked(rwb);
+
 		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
 		rwbase_schedule();
 		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
 
 		set_current_state(state);
 	}
+
+	rwb->waiter_timeout = 0;
 	rwbase_restore_current_state();
 	trace_contention_end(rwb, 0);

Sebastian Andrzej Siewior Jan. 19, 2023, 8:25 a.m. UTC | #6

On 2023-01-18 17:31:30 [+0000], Mel Gorman wrote:
 > If we drop that "we prefer the RT reader" then it would block on the
> > RTmutex. It will _still_ be preferred over the writer because it will be
> > enqueued before the writer in the queue due to its RT priority. The only
> > downside is that it has to wait until all readers are left.
> 
> The writer has to wait until all the readers have left anyway.

I meant the READER in case it has RT priority. It will enqueue itself on
the RTmutex, first in line, and wait until all other READER leave.

> If I understand you correctly, the patch becomes this;

exactly.

> --8<--
…
> This patch records a timestamp when the first writer is blocked. DT /

s/DT/DL

> RT tasks can continue to take the lock for read as long as readers exist
> indefinitely. Other readers can acquire the read lock unless a writer
> has been blocked for a minimum of 4ms. This is sufficient to allow the
> dio_truncate test case to complete within the 30 minutes timeout.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
…
> diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
> index c201aadb9301..84c5e4e4d25b 100644
> --- a/kernel/locking/rwbase_rt.c
> +++ b/kernel/locking/rwbase_rt.c
> @@ -74,9 +106,11 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
>  	raw_spin_lock_irq(&rtm->wait_lock);
>  	/*
>  	 * Allow readers, as long as the writer has not completely
> -	 * acquired the semaphore for write.
> +	 * acquired the semaphore for write and reader bias is still
> +	 * allowed.
>  	 */
> -	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
> +	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
> +	    rwbase_allow_reader_bias(rwb)) {
>  		atomic_inc(&rwb->readers);
>  		raw_spin_unlock_irq(&rtm->wait_lock);
>  		return 0;
> @@ -264,12 +298,17 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
>  		if (__rwbase_write_trylock(rwb))
>  			break;
>  
> +		/* Record first new read/write contention. */
> +		set_writer_blocked(rwb);
> +
>  		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
>  		rwbase_schedule();
>  		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
>  
>  		set_current_state(state);
>  	}
> +
> +	rwb->waiter_timeout = 0;

Regarding memory ordering and ordering in general:
- Should the writer leave from rwbase_schedule() due to a signal then
  set_writer_blocked() sets a timeout but it is not cleared on the
  signal leave.

- There is only writer in that for loop within rwbase_write_lock()
  because only one writer can own the rtmutex at a time. (A second
  writer blocks on the RTmutex and needs to wait, I may have spread some
  confusion earler). Therefore it should be okay to unconditionally set
  the timeout (instead of checking for zero).

- Once the writer removes READER_BIAS, it forces the reader into the
  slowpath. At that time the writer does not own the wait_lock meaning
  the reader _could_ check the timeout before writer had a chance to set
  it. The worst thing is probably that if jiffies does not have the
  highest bit set then it will always disable the reader bias here.
  The easiest thing is probably to check timeout vs 0 and ensure on the
  writer side that the lowest bit is always set (in the unlikely case it
  will end up as zero).

>  	rwbase_restore_current_state();
>  	trace_contention_end(rwb, 0);

Sebastian

Mel Gorman Jan. 19, 2023, 11:02 a.m. UTC | #7

On Thu, Jan 19, 2023 at 09:25:56AM +0100, Sebastian Andrzej Siewior wrote:
> On 2023-01-18 17:31:30 [+0000], Mel Gorman wrote:
>  > If we drop that "we prefer the RT reader" then it would block on the
> > > RTmutex. It will _still_ be preferred over the writer because it will be
> > > enqueued before the writer in the queue due to its RT priority. The only
> > > downside is that it has to wait until all readers are left.
> > 
> > The writer has to wait until all the readers have left anyway.
> 
> I meant the READER in case it has RT priority. It will enqueue itself on
> the RTmutex, first in line, and wait until all other READER leave.
> 

Ah.

> > If I understand you correctly, the patch becomes this;
> 
> exactly.
> 
> > --8<--
> ???
> > This patch records a timestamp when the first writer is blocked. DT /
> 
> s/DT/DL
> 

Fixed.

> > RT tasks can continue to take the lock for read as long as readers exist
> > indefinitely. Other readers can acquire the read lock unless a writer
> > has been blocked for a minimum of 4ms. This is sufficient to allow the
> > dio_truncate test case to complete within the 30 minutes timeout.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> ???
> > diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
> > index c201aadb9301..84c5e4e4d25b 100644
> > --- a/kernel/locking/rwbase_rt.c
> > +++ b/kernel/locking/rwbase_rt.c
> > @@ -74,9 +106,11 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
> >  	raw_spin_lock_irq(&rtm->wait_lock);
> >  	/*
> >  	 * Allow readers, as long as the writer has not completely
> > -	 * acquired the semaphore for write.
> > +	 * acquired the semaphore for write and reader bias is still
> > +	 * allowed.
> >  	 */
> > -	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
> > +	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
> > +	    rwbase_allow_reader_bias(rwb)) {
> >  		atomic_inc(&rwb->readers);
> >  		raw_spin_unlock_irq(&rtm->wait_lock);
> >  		return 0;
> > @@ -264,12 +298,17 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
> >  		if (__rwbase_write_trylock(rwb))
> >  			break;
> >  
> > +		/* Record first new read/write contention. */
> > +		set_writer_blocked(rwb);
> > +
> >  		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
> >  		rwbase_schedule();
> >  		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
> >  
> >  		set_current_state(state);
> >  	}
> > +
> > +	rwb->waiter_timeout = 0;
> 
> Regarding memory ordering and ordering in general:
> - Should the writer leave from rwbase_schedule() due to a signal then
>   set_writer_blocked() sets a timeout but it is not cleared on the
>   signal leave.
> 

You're correct, it should be reset in the signal check block before the
wait_lock is released by __rwbase_write_unlock. I considered different
ways to avoid multiple reset points but it was untidy.

> - There is only writer in that for loop within rwbase_write_lock()
>   because only one writer can own the rtmutex at a time. (A second
>   writer blocks on the RTmutex and needs to wait, I may have spread some
>   confusion earler). Therefore it should be okay to unconditionally set
>   the timeout (instead of checking for zero).
> 

Ah ok, I see now or at least I think I do.

> - Once the writer removes READER_BIAS, it forces the reader into the
>   slowpath.

Removed in __rwbase_write_trylock IIUC

>   At that time the writer does not own the wait_lock meaning
>   the reader _could_ check the timeout before writer had a chance to set
>   it. The worst thing is probably that if jiffies does not have the
>   highest bit set then it will always disable the reader bias here.
>   The easiest thing is probably to check timeout vs 0 and ensure on the
>   writer side that the lowest bit is always set (in the unlikely case it
>   will end up as zero).
> 

I am missing something important. On the read side, we have

        raw_spin_lock_irq(&rtm->wait_lock);
        /*
         * Allow readers, as long as the writer has not completely
         * acquired the semaphore for write and reader bias is still
         * allowed.
         */
        if (atomic_read(&rwb->readers) != WRITER_BIAS &&
            rwbase_allow_reader_bias(rwb)) {
                atomic_inc(&rwb->readers);
                raw_spin_unlock_irq(&rtm->wait_lock);
                return 0;
        }

So rtm->wait_lock is held and both the WRITER_BASE and timeout are checked
under the lock. On the write side it's also held when the timeout is
updated here

        raw_spin_lock_irqsave(&rtm->wait_lock, flags);
	...
	for (;;) {
		...

                /* Record timeout when reader bias is ignored. */
                rwb->waiter_timeout = jiffies + RWBASE_RT_WAIT_TIMEOUT;

                raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
                rwbase_schedule();
                raw_spin_lock_irqsave(&rtm->wait_lock, flags);

                set_current_state(state);
        }

        rwb->waiter_timeout = 0;
	...

out_unlock:
        raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);

(it's also now held in the signal block when it is cleared)

I'm not seeing exactly what the problem is unless it's an issue in the
fast path but I think the atomic_try_cmpxchg_acquire works there.

In the absense of wait_lock, I guess there would be a potential
race between the atomic_read(&rwb->readers) != WRITER_BIAS and
rwbase_allow_reader_bias(rwb) but that shouldn't be the case now. Even
if it wasn't locked, it might allow a new reader through but it'd be
a transient problem and writer starvation should be prevented once the
timeout is observed.

Sebastian Andrzej Siewior Jan. 19, 2023, 4:28 p.m. UTC | #8

On 2023-01-19 11:02:20 [+0000], Mel Gorman wrote:
> > - Once the writer removes READER_BIAS, it forces the reader into the
> >   slowpath.
> 
> Removed in __rwbase_write_trylock IIUC

And added back in case try trylock failed via __rwbase_write_unlock().
The RTmutex is unlocked and the READER_BIAS is "returned".

> >   At that time the writer does not own the wait_lock meaning
> >   the reader _could_ check the timeout before writer had a chance to set
> >   it. The worst thing is probably that if jiffies does not have the
> >   highest bit set then it will always disable the reader bias here.
> >   The easiest thing is probably to check timeout vs 0 and ensure on the
> >   writer side that the lowest bit is always set (in the unlikely case it
> >   will end up as zero).
> > 
> 
> I am missing something important. On the read side, we have
> 

Look at this side by side:

                writer                                                       reader

| static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
|                                      unsigned int state)
| {
|         /* Force readers into slow path */
|         atomic_sub(READER_BIAS, &rwb->readers);


|                                                               static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
|                                                                                                     unsigned int state)
|                                                               {       
|                                                                       struct rt_mutex_base *rtm = &rwb->rtmutex;
|                                                                       int ret;                         
|                                                               
|                                                                       raw_spin_lock_irq(&rtm->wait_lock);

Reader has the lock, writer will wait.

|                                                                       /*
|                                                                        * Allow readers, as long as the writer has not completely
|                                                                        * acquired the semaphore for write.
|                                                                        */
|                                                                       if (atomic_read(&rwb->readers) != WRITER_BIAS) {

here, the timeout value is not yet populated by the writer so the reader
compares vs 0.

|                                                                               atomic_inc(&rwb->readers);
|                                                                               raw_spin_unlock_irq(&rtm->wait_lock);
|                                                                               return 0;
|                                                                       }
|                                                              

|         raw_spin_lock_irqsave(&rtm->wait_lock, flags);
|         if (__rwbase_write_trylock(rwb))
|                 goto out_unlock;
|

Hope this makes it easier.

Sebastian

Mel Gorman Jan. 19, 2023, 5:41 p.m. UTC | #9

On Thu, Jan 19, 2023 at 05:28:48PM +0100, Sebastian Andrzej Siewior wrote:
> On 2023-01-19 11:02:20 [+0000], Mel Gorman wrote:
> > > - Once the writer removes READER_BIAS, it forces the reader into the
> > >   slowpath.
> > 
> > Removed in __rwbase_write_trylock IIUC
> 
> And added back in case try trylock failed via __rwbase_write_unlock().
> The RTmutex is unlocked and the READER_BIAS is "returned".
> 

Indeed.

> > >   At that time the writer does not own the wait_lock meaning
> > >   the reader _could_ check the timeout before writer had a chance to set
> > >   it. The worst thing is probably that if jiffies does not have the
> > >   highest bit set then it will always disable the reader bias here.
> > >   The easiest thing is probably to check timeout vs 0 and ensure on the
> > >   writer side that the lowest bit is always set (in the unlikely case it
> > >   will end up as zero).
> > > 
> > 
> > I am missing something important. On the read side, we have
> > 
> 
> Look at this side by side:
> 
>                 writer                                                       reader
> 
> | static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
> |                                      unsigned int state)
> | {
> |         /* Force readers into slow path */
> |         atomic_sub(READER_BIAS, &rwb->readers);
> 
> 
> |                                                               static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
> |                                                                                                     unsigned int state)
> |                                                               {       
> |                                                                       struct rt_mutex_base *rtm = &rwb->rtmutex;
> |                                                                       int ret;                         
> |                                                               
> |                                                                       raw_spin_lock_irq(&rtm->wait_lock);
> 
> Reader has the lock, writer will wait.
> 
> |                                                                       /*
> |                                                                        * Allow readers, as long as the writer has not completely
> |                                                                        * acquired the semaphore for write.
> |                                                                        */
> |                                                                       if (atomic_read(&rwb->readers) != WRITER_BIAS) {
> 
> here, the timeout value is not yet populated by the writer so the reader
> compares vs 0.
> 
> |                                                                               atomic_inc(&rwb->readers);
> |                                                                               raw_spin_unlock_irq(&rtm->wait_lock);
> |                                                                               return 0;
> |                                                                       }
> |                                                              
> 
> |         raw_spin_lock_irqsave(&rtm->wait_lock, flags);
> |         if (__rwbase_write_trylock(rwb))
> |                 goto out_unlock;
> |
> 
> Hope this makes it easier.
> 

Yes, it makes your concern much clearer but I'm not sure it actually matters
in terms of preventing write starvation or in terms of correctness. At
worst, a writer is blocked that could have acquired the lock during a tiny
race but that's a timing issue rather than a correctness issue.

Lets say the race hits

									reader sees waiter_timeout == 0
	writer acquires wait_lock
	__rwbase_write_trylock fails
	update waiter_timeout
	rwbase_schedule

Each reader that hits the race goes ahead at a point in time but anything
readers after that observe the timeout and eventually the writer goes ahead.

If the waiter_timeout was updated before atomic_sub(READER_BIAS),
it doesn't close the race as atomic_sub is unordered so barriers would
also be needed and clearing of waiter_timeout moves to out_unlock in case
__rwbase_write_trylock succeeds. That's possible but the need for barriers
makes it more complicated than is necessary.

The race could be closed by moving wait_lock acquisition before the
atomic_sub in rwbase_write_lock() but it expands the scope of the wait_lock
and I'm not sure that's necessary for either correctness or preventing
writer starvation. It's a more straight-forward fix but expanding the
scope of a lock unnecessarily has been unpopular in the past.

I think we can close the race that concerns you but I'm not convinced we
need to and changing the scope of wait_lock would need a big comment and
probably deserves a separate patch.

Sorry if I'm still missing something stupid and thanks for your patience
reviewing this.

Davidlohr Bueso Jan. 19, 2023, 5:48 p.m. UTC | #10

On Thu, 19 Jan 2023, Mel Gorman wrote:

>The race could be closed by moving wait_lock acquisition before the
>atomic_sub in rwbase_write_lock() but it expands the scope of the wait_lock
>and I'm not sure that's necessary for either correctness or preventing
>writer starvation. It's a more straight-forward fix but expanding the
>scope of a lock unnecessarily has been unpopular in the past.

Curiously, this is the documented behavior:

  * down_write/write_lock()
  *  1) Lock rtmutex
  *  2) Remove the reader BIAS to force readers into the slow path

Davidlohr Bueso Jan. 19, 2023, 5:58 p.m. UTC | #11

On Thu, 19 Jan 2023, Davidlohr Bueso wrote:

>On Thu, 19 Jan 2023, Mel Gorman wrote:
>
>>The race could be closed by moving wait_lock acquisition before the
>>atomic_sub in rwbase_write_lock() but it expands the scope of the wait_lock
>>and I'm not sure that's necessary for either correctness or preventing
>>writer starvation. It's a more straight-forward fix but expanding the
>>scope of a lock unnecessarily has been unpopular in the past.
>
>Curiously, this is the documented behavior:
>
> * down_write/write_lock()
> *  1) Lock rtmutex
> *  2) Remove the reader BIAS to force readers into the slow path

Nevermind, this was the rtmutex, not the wait_lock, sorry for the noise.

Sebastian Andrzej Siewior Jan. 20, 2023, 8:25 a.m. UTC | #12

On 2023-01-19 17:41:01 [+0000], Mel Gorman wrote:
> 
> Yes, it makes your concern much clearer but I'm not sure it actually matters
> in terms of preventing write starvation or in terms of correctness. At
> worst, a writer is blocked that could have acquired the lock during a tiny
> race but that's a timing issue rather than a correctness issue.

Correct. My concern is that one reader may need to wait 4ms+ for the
lock while a following reader (that one that sees the timeout) does not.
This can lead to confusion later on.

> Lets say the race hits
> 
> 									reader sees waiter_timeout == 0
> 	writer acquires wait_lock
> 	__rwbase_write_trylock fails
> 	update waiter_timeout
> 	rwbase_schedule
> 
> Each reader that hits the race goes ahead at a point in time but anything
> readers after that observe the timeout and eventually the writer goes ahead.
> 
> If the waiter_timeout was updated before atomic_sub(READER_BIAS),
> it doesn't close the race as atomic_sub is unordered so barriers would
> also be needed and clearing of waiter_timeout moves to out_unlock in case
> __rwbase_write_trylock succeeds. That's possible but the need for barriers
> makes it more complicated than is necessary.

yes...

> The race could be closed by moving wait_lock acquisition before the
> atomic_sub in rwbase_write_lock() but it expands the scope of the wait_lock
> and I'm not sure that's necessary for either correctness or preventing
> writer starvation. It's a more straight-forward fix but expanding the
> scope of a lock unnecessarily has been unpopular in the past.
> 
> I think we can close the race that concerns you but I'm not convinced we
> need to and changing the scope of wait_lock would need a big comment and
> probably deserves a separate patch.

would it work to check the timeout vs 0 before and only apply the
timeout check if it is != zero? The writer would need to unconditionally
or the lowest bit. That should close gaps at a low price. The timeout
variable is always read within the lock so there shouldn't be need for
any additional barriers.

> Sorry if I'm still missing something stupid and thanks for your patience
> reviewing this.
thank that it is patience and not pain in the ass ;)

Sebastian

Mel Gorman Jan. 20, 2023, 1:24 p.m. UTC | #13

On Fri, Jan 20, 2023 at 09:25:00AM +0100, Sebastian Andrzej Siewior wrote:
> On 2023-01-19 17:41:01 [+0000], Mel Gorman wrote:
> > 
> > Yes, it makes your concern much clearer but I'm not sure it actually matters
> > in terms of preventing write starvation or in terms of correctness. At
> > worst, a writer is blocked that could have acquired the lock during a tiny
> > race but that's a timing issue rather than a correctness issue.
> 
> Correct. My concern is that one reader may need to wait 4ms+ for the
> lock while a following reader (that one that sees the timeout) does not.
> This can lead to confusion later on.
> 

Ok, yes, that is a valid concern I had not considered when thinking in
terms of correctness or writer starvation. It would be very tricky to
diagnose if it happened.

> > The race could be closed by moving wait_lock acquisition before the
> > atomic_sub in rwbase_write_lock() but it expands the scope of the wait_lock
> > and I'm not sure that's necessary for either correctness or preventing
> > writer starvation. It's a more straight-forward fix but expanding the
> > scope of a lock unnecessarily has been unpopular in the past.
> > 
> > I think we can close the race that concerns you but I'm not convinced we
> > need to and changing the scope of wait_lock would need a big comment and
> > probably deserves a separate patch.
> 
> would it work to check the timeout vs 0 before and only apply the
> timeout check if it is != zero? The writer would need to unconditionally
> or the lowest bit. That should close gaps at a low price. The timeout
> variable is always read within the lock so there shouldn't be need for
> any additional barriers.
> 

Yes, as a bonus point, it can be checked early in rwbase_allow_reader_bias
and is an cheap test for the common case so it's win-win all round.

Patch is now this;

--8<--
locking/rwbase: Prevent indefinite writer starvation

rw_semaphore and rwlock are explicitly unfair to writers in the presense
of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;

        The implementation is writer unfair, as it is not feasible to do
        priority inheritance on multiple readers, but experience has shown
        that real-time workloads are not the typical workloads which are
        sensitive to writer starvation.

While atypical, it's also trivial to block writers with PREEMPT_RT
indefinitely without ever making forward progress. Since LTP-20220121,
the dio_truncate test case went from having 1 reader to having 16 readers
and the number of readers is sufficient to prevent the down_write ever
succeeding while readers exist. Eventually the test is killed after 30
minutes as a failure.

dio_truncate is not a realtime application but indefinite writer starvation
is undesirable. The test case has one writer appending and truncating files
A and B while multiple readers read file A. The readers and writer are
contending for one file's inode lock which never succeeds as the readers
keep reading until the writer is done which never happens.

This patch records a timestamp when the first writer is blocked. DL /
RT tasks can continue to take the lock for read as long as readers exist
indefinitely. Other readers can acquire the read lock unless a writer
has been blocked for a minimum of 4ms. This is sufficient to allow the
dio_truncate test case to complete within the 30 minutes timeout.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/rwbase_rt.h  |  3 +++
 kernel/locking/rwbase_rt.c | 38 +++++++++++++++++++++++++++++++++++---
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/include/linux/rwbase_rt.h b/include/linux/rwbase_rt.h
index 1d264dd08625..b969b1d9bb85 100644
--- a/include/linux/rwbase_rt.h
+++ b/include/linux/rwbase_rt.h
@@ -10,12 +10,14 @@
 
 struct rwbase_rt {
 	atomic_t		readers;
+	unsigned long		waiter_timeout;
 	struct rt_mutex_base	rtmutex;
 };
 
 #define __RWBASE_INITIALIZER(name)				\
 {								\
 	.readers = ATOMIC_INIT(READER_BIAS),			\
+	.waiter_timeout = 0,					\
 	.rtmutex = __RT_MUTEX_BASE_INITIALIZER(name.rtmutex),	\
 }
 
@@ -23,6 +25,7 @@ struct rwbase_rt {
 	do {							\
 		rt_mutex_base_init(&(rwbase)->rtmutex);		\
 		atomic_set(&(rwbase)->readers, READER_BIAS);	\
+		(rwbase)->waiter_timeout = 0;			\
 	} while (0)
 
 
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index c201aadb9301..9d5bbf2985de 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -39,7 +39,10 @@
  * major surgery for a very dubious value.
  *
  * The risk of writer starvation is there, but the pathological use cases
- * which trigger it are not necessarily the typical RT workloads.
+ * which trigger it are not necessarily the typical RT workloads. SCHED_OTHER
+ * reader acquisitions will be forced into the slow path if a writer is
+ * blocked for more than RWBASE_RT_WAIT_TIMEOUT jiffies. New DL / RT readers
+ * can still starve a writer indefinitely.
  *
  * Fast-path orderings:
  * The lock/unlock of readers can run in fast paths: lock and unlock are only
@@ -65,6 +68,27 @@ static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
 	return 0;
 }
 
+/*
+ * Allow reader bias for SCHED_OTHER tasks with a pending writer for a
+ * minimum of 4ms or 1 tick. This matches RWSEM_WAIT_TIMEOUT for the
+ * generic RWSEM implementation.
+ */
+#define RWBASE_RT_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
+
+static bool __sched rwbase_allow_reader_bias(struct rwbase_rt *rwb)
+{
+	/*
+	 * Allow reader bias if no writer is blocked or for DL / RT tasks.
+	 * Such tasks should be designed to avoid heavy writer contention
+	 * or indefinite starvation.
+	 */
+	if (!rwb->waiter_timeout || rt_task(current))
+		return true;
+
+	/* Allow reader bias unless a writer timeout has expired. */
+	return time_before(jiffies, rwb->waiter_timeout);
+}
+
 static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 				      unsigned int state)
 {
@@ -74,9 +98,11 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	raw_spin_lock_irq(&rtm->wait_lock);
 	/*
 	 * Allow readers, as long as the writer has not completely
-	 * acquired the semaphore for write.
+	 * acquired the semaphore for write and reader bias is still
+	 * allowed.
 	 */
-	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
+	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
+	    rwbase_allow_reader_bias(rwb)) {
 		atomic_inc(&rwb->readers);
 		raw_spin_unlock_irq(&rtm->wait_lock);
 		return 0;
@@ -255,6 +281,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 	for (;;) {
 		/* Optimized out for rwlocks */
 		if (rwbase_signal_pending_state(state, current)) {
+			rwb->waiter_timeout = 0;
 			rwbase_restore_current_state();
 			__rwbase_write_unlock(rwb, 0, flags);
 			trace_contention_end(rwb, -EINTR);
@@ -264,12 +291,17 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 		if (__rwbase_write_trylock(rwb))
 			break;
 
+		/* Record timeout when reader bias is ignored. */
+		rwb->waiter_timeout = jiffies + RWBASE_RT_WAIT_TIMEOUT;
+
 		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
 		rwbase_schedule();
 		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
 
 		set_current_state(state);
 	}
+
+	rwb->waiter_timeout = 0;
 	rwbase_restore_current_state();
 	trace_contention_end(rwb, 0);

Sebastian Andrzej Siewior Jan. 20, 2023, 1:38 p.m. UTC | #14

On 2023-01-20 13:24:41 [+0000], Mel Gorman wrote:
> --- a/kernel/locking/rwbase_rt.c
> +++ b/kernel/locking/rwbase_rt.c
> @@ -264,12 +291,17 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
>  		if (__rwbase_write_trylock(rwb))
>  			break;
>  
> +		/* Record timeout when reader bias is ignored. */
> +		rwb->waiter_timeout = jiffies + RWBASE_RT_WAIT_TIMEOUT;
		rwb->waiter_timeout = (jiffies + RWBASE_RT_WAIT_TIMEOUT) | 1;

There is the unlikely case that (jiffies + RWBASE_RT_WAIT_TIMEOUT) = 0
on 32bit where it is not jiffies64.

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

> +
>  		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
>  		rwbase_schedule();
>  		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
>  
>  		set_current_state(state);
>  	}
> +
> +	rwb->waiter_timeout = 0;
>  	rwbase_restore_current_state();
>  	trace_contention_end(rwb, 0);

Sebastian

Mel Gorman Jan. 20, 2023, 2:07 p.m. UTC | #15

On Fri, Jan 20, 2023 at 02:38:27PM +0100, Sebastian Andrzej Siewior wrote:
> On 2023-01-20 13:24:41 [+0000], Mel Gorman wrote:
> > --- a/kernel/locking/rwbase_rt.c
> > +++ b/kernel/locking/rwbase_rt.c
> > @@ -264,12 +291,17 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
> >  		if (__rwbase_write_trylock(rwb))
> >  			break;
> >  
> > +		/* Record timeout when reader bias is ignored. */
> > +		rwb->waiter_timeout = jiffies + RWBASE_RT_WAIT_TIMEOUT;
> 		rwb->waiter_timeout = (jiffies + RWBASE_RT_WAIT_TIMEOUT) | 1;
> 
> There is the unlikely case that (jiffies + RWBASE_RT_WAIT_TIMEOUT) = 0
> on 32bit where it is not jiffies64.
> 
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 

Thanks very much, updated version will be posted shortly.

Davidlohr Bueso Jan. 20, 2023, 3:36 p.m. UTC | #16

On Fri, 20 Jan 2023, Mel Gorman wrote:

>locking/rwbase: Prevent indefinite writer starvation
>
>rw_semaphore and rwlock are explicitly unfair to writers in the presense
>of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
>("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;
>
>        The implementation is writer unfair, as it is not feasible to do
>        priority inheritance on multiple readers, but experience has shown
>        that real-time workloads are not the typical workloads which are
>        sensitive to writer starvation.
>
>While atypical, it's also trivial to block writers with PREEMPT_RT
>indefinitely without ever making forward progress. Since LTP-20220121,
>the dio_truncate test case went from having 1 reader to having 16 readers
>and the number of readers is sufficient to prevent the down_write ever
>succeeding while readers exist. Eventually the test is killed after 30
>minutes as a failure.
>
>dio_truncate is not a realtime application but indefinite writer starvation
>is undesirable. The test case has one writer appending and truncating files
>A and B while multiple readers read file A. The readers and writer are
>contending for one file's inode lock which never succeeds as the readers
>keep reading until the writer is done which never happens.
>
>This patch records a timestamp when the first writer is blocked. DL /
>RT tasks can continue to take the lock for read as long as readers exist
>indefinitely. Other readers can acquire the read lock unless a writer
>has been blocked for a minimum of 4ms. This is sufficient to allow the
>dio_truncate test case to complete within the 30 minutes timeout.
>
>Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

LGTM (albeit Sebastian's last comment).

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>

[v2] locking/rwbase: Prevent indefinite writer starvation

Commit Message

Comments

Patch