[RFC,0/2] RT scheduling policies for workqueues

Message ID	20220323145600.2156689-1-linux@rasmusvillemoes.dk
Headers	show Return-Path: <linux-rt-users-owner@kernel.org> From: Rasmus Villemoes <linux@rasmusvillemoes.dk> To: Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>, linux-kernel@vger.kernel.org Cc: =?utf-8?q?Andr=C3=A9_Pribil?= <a.pribil@beck-ipc.com>, Steven Walter <stevenrwalter@gmail.com>, Oleksij Rempel <o.rempel@pengutronix.de>, Esben Haabendal <esben@geanix.com>, Jiri Slaby <jirislaby@kernel.org>, Pengutronix Kernel Team <kernel@pengutronix.de>, Peter Hurley <peter@hurleysoftware.com>, linux-rt-users@vger.kernel.org, Rasmus Villemoes <linux@rasmusvillemoes.dk> Subject: [RFC PATCH 0/2] RT scheduling policies for workqueues Date: Wed, 23 Mar 2022 15:55:58 +0100 Message-Id: <20220323145600.2156689-1-linux@rasmusvillemoes.dk> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	RT scheduling policies for workqueues \| expand [RFC,0/2] RT scheduling policies for workqueues [RFC,1/2] workqueue: allow use of realtime scheduling policies [RFC,2/2] workqueue: update sysfs handlers, allow setting RT policies

Rasmus Villemoes March 23, 2022, 2:55 p.m. UTC

This RFC is motivated by an old problem in the tty layer. Ever since
commit a9c3f68f3cd8 (tty: Fix low_latency BUG), use of UART for
real-time applications has been problematic. Even if both the
application itself and the irq thread are set to SCHED_FIFO, the fact
that the flush_to_ldisc work is scheduled on the generic and global
system_unbound_wq (with all workers running at normal scheduling
priority) means that UART RX can suffer unbounded latency.

This is far from the first time this has come up [1] [2] [3], but so
far no solution acceptable by mainline has been found.

I have been playing around with a solution that would (depending on a
tty.rx_worker module parameter) create a kthread_worker per port,
embed a 'struct kthread_work' alongside the 'struct work_struct' in
struct tty_bufhead, and then choose between the current queue_work()
and kthread_queue_work() based on whether buf->kworker is NULL or
not. That works, but is a bit cumbersome for the application, since it
needs to traverse all of /proc to find the appropriate kthread and set
its priority after opening the tty. It's also not a very pretty
solution.

So, taking a cue from something Linus said in the [2] thread, this is
an attempt at keeping the use of workqueues. A WQ_HIGHPRI, aka nice
-19, workqueue is not enough for RT guarantees. So this extends struct
workqueue_attrs by an 'int policy', allowing userspace via the sysfs
interface to choose, say, SCHED_FIFO 37, for a given workqueue.

This is obviously not for merging as-is, but it is working code that
I've tested would solve the problem we're having (with the few rather
trivial changes in the tty layer, which are not really worth spelling
out here). But before polishing this, I'd like to know if anything
like this has any chance of making it to mainline.



[1] https://lore.kernel.org/linux-rt-users/20180718164958.l3f4ajloobdkp5tz@linutronix.de/T/
[2] https://lore.kernel.org/lkml/20190110101232.9398-1-o.rempel@pengutronix.de/
[3] https://www.spinics.net/lists/linux-serial/msg17782.html


Rasmus Villemoes (2):
  workqueue: allow use of realtime scheduling policies
  workqueue: update sysfs handlers, allow setting RT policies

 include/linux/workqueue.h | 17 +++++++--
 kernel/workqueue.c        | 74 ++++++++++++++++++++++++++++++++++-----
 2 files changed, 80 insertions(+), 11 deletions(-)

Sebastian Andrzej Siewior March 28, 2022, 10:05 a.m. UTC | #1

On 2022-03-23 15:55:58 [+0100], Rasmus Villemoes wrote:
> This RFC is motivated by an old problem in the tty layer. Ever since
> commit a9c3f68f3cd8 (tty: Fix low_latency BUG), use of UART for
> real-time applications has been problematic. Even if both the
> application itself and the irq thread are set to SCHED_FIFO, the fact
> that the flush_to_ldisc work is scheduled on the generic and global
> system_unbound_wq (with all workers running at normal scheduling
> priority) means that UART RX can suffer unbounded latency.

Having a kthread per "low-latency" tty instance is something I would
prefer. The kwork corner is an anonymous worker instance and probably
does more harm than good. Especially if it is a knob for everyone which
is used for the wrong reasons and manages to be harmful in the end.
With a special kthread for a particular tty, the thread can be assigned
with the desired priority within the system and ttyS1 can be
distinguished from ttyS0 (and so on). This turned out to be useful in a
few setups over the years.

Sebastian

Marc Kleine-Budde March 28, 2022, 10:09 a.m. UTC | #2

On 28.03.2022 12:05:57, Sebastian Andrzej Siewior wrote:
> On 2022-03-23 15:55:58 [+0100], Rasmus Villemoes wrote:
> > This RFC is motivated by an old problem in the tty layer. Ever since
> > commit a9c3f68f3cd8 (tty: Fix low_latency BUG), use of UART for
> > real-time applications has been problematic. Even if both the
> > application itself and the irq thread are set to SCHED_FIFO, the fact
> > that the flush_to_ldisc work is scheduled on the generic and global
> > system_unbound_wq (with all workers running at normal scheduling
> > priority) means that UART RX can suffer unbounded latency.
> 
> Having a kthread per "low-latency" tty instance is something I would
> prefer. The kwork corner is an anonymous worker instance and probably
> does more harm than good. Especially if it is a knob for everyone which
> is used for the wrong reasons and manages to be harmful in the end.
> With a special kthread for a particular tty, the thread can be assigned
> with the desired priority within the system and ttyS1 can be
> distinguished from ttyS0 (and so on). This turned out to be useful in a
> few setups over the years.

+1

The networking subsystem has gone the same/similar way with NAPI. NAPI
handling can be switched from the softirq to kernel thread on a per
interface basis.

regards
Marc

Tejun Heo March 28, 2022, 5:39 p.m. UTC | #3

Hello,

On Mon, Mar 28, 2022 at 12:09:27PM +0200, Marc Kleine-Budde wrote:
> > Having a kthread per "low-latency" tty instance is something I would
> > prefer. The kwork corner is an anonymous worker instance and probably
> > does more harm than good. Especially if it is a knob for everyone which
> > is used for the wrong reasons and manages to be harmful in the end.
> > With a special kthread for a particular tty, the thread can be assigned
> > with the desired priority within the system and ttyS1 can be
> > distinguished from ttyS0 (and so on). This turned out to be useful in a
> > few setups over the years.
> 
> +1
> 
> The networking subsystem has gone the same/similar way with NAPI. NAPI
> handling can be switched from the softirq to kernel thread on a per
> interface basis.

I wonder whether it'd be useful to provide a set of wrappers which can make
switching between workqueue and kworker easy. Semantics-wise, they're
already mostly aligned and it shouldn't be too difficult to e.g. make an
unbounded workqueue be backed by a dedicated kthread_worker instead of
shared pool depending on a flag, or even allow switching dynamically.

Thanks.

Marc Kleine-Budde March 28, 2022, 6:07 p.m. UTC | #4

On 28.03.2022 07:39:25, Tejun Heo wrote:
> Hello,
> 
> On Mon, Mar 28, 2022 at 12:09:27PM +0200, Marc Kleine-Budde wrote:
> > > Having a kthread per "low-latency" tty instance is something I would
> > > prefer. The kwork corner is an anonymous worker instance and probably
> > > does more harm than good. Especially if it is a knob for everyone which
> > > is used for the wrong reasons and manages to be harmful in the end.
> > > With a special kthread for a particular tty, the thread can be assigned
> > > with the desired priority within the system and ttyS1 can be
> > > distinguished from ttyS0 (and so on). This turned out to be useful in a
> > > few setups over the years.
> > 
> > +1
> > 
> > The networking subsystem has gone the same/similar way with NAPI. NAPI
> > handling can be switched from the softirq to kernel thread on a per
> > interface basis.
> 
> I wonder whether it'd be useful to provide a set of wrappers which can make
> switching between workqueue and kworker easy. Semantics-wise, they're
> already mostly aligned and it shouldn't be too difficult to e.g. make an
> unbounded workqueue be backed by a dedicated kthread_worker instead of
> shared pool depending on a flag, or even allow switching dynamically.

For NAPI a sysfs entry was added to switch to threaded mode:

| 5fdd2f0e5c64 net: add sysfs attribute to control napi threaded mode
| 29863d41bb6e net: implement threaded-able napi poll loop support
| 898f8015ffe7 net: extract napi poll functionality to __napi_poll()

regards,
Marc

Sebastian Andrzej Siewior March 29, 2022, 6:30 a.m. UTC | #5

On 2022-03-28 07:39:25 [-1000], Tejun Heo wrote:
> Hello,
Hi,

> I wonder whether it'd be useful to provide a set of wrappers which can make
> switching between workqueue and kworker easy. Semantics-wise, they're
> already mostly aligned and it shouldn't be too difficult to e.g. make an
> unbounded workqueue be backed by a dedicated kthread_worker instead of
> shared pool depending on a flag, or even allow switching dynamically.

This could work. For the tty layer it could use 'lowlatency' attribute
to decide which implementation makes sense.

> Thanks.
> 

Sebastian

Rasmus Villemoes March 29, 2022, 8:33 a.m. UTC | #6

On 29/03/2022 08.30, Sebastian Andrzej Siewior wrote:
> On 2022-03-28 07:39:25 [-1000], Tejun Heo wrote:
>> Hello,
> Hi,
> 
>> I wonder whether it'd be useful to provide a set of wrappers which can make
>> switching between workqueue and kworker easy. Semantics-wise, they're
>> already mostly aligned and it shouldn't be too difficult to e.g. make an
>> unbounded workqueue be backed by a dedicated kthread_worker instead of
>> shared pool depending on a flag, or even allow switching dynamically.

Well, that would certainly not make it any easier for userspace to
discover the thread it needs to chrt().

> This could work. For the tty layer it could use 'lowlatency' attribute
> to decide which implementation makes sense.

I have patches that merely touch the tty layer, but tying it to the
lowlatency attribute is quite painful (which has also come up in
previous discussions on this) - because the lowlatency flag can be
flipped from userspace, but synchronizing which variant is used and
switching dynamically is at least beyond my skills to make work
robustly. So in my patches, the choice is made at open() time. However,
I'm still not convinced code like

 struct tty_bufhead {
        struct tty_buffer *head;        /* Queue head */
        struct work_struct work;
+       struct kthread_work kwork;
+       struct kthread_worker *kworker;

 bool tty_buffer_restart_work(struct tty_port *port)
 {
-       return queue_work(system_unbound_wq, &port->buf.work);
+       struct tty_bufhead *buf = &port->buf;
+
+       if (buf->kworker)
+               return kthread_queue_work(buf->kworker, &buf->kwork);
+       else
+               return queue_work(system_unbound_wq, &buf->work);
 }

etc. is the way to go.

===

Here's another idea: In an ideal world, the irq thread itself [people
caring about latency use threaded interrupts] could just do the work
immediately - then the admin only has one kernel thread to properly
configure. However, as Sebastian pointed out, doing that leads to a
lockdep splat [1], and it also means that there's no work item involved,
so some other thread calling tty_buffer_flush_work() might not actually
wait for a concurrent flush_to_ldisc() to finish. So could we create a
struct hybrid_work { } which, when enqueued, does something like

  bool current_is_irqthread(void) { return in_task() &&
kthread_func(current) == irq_thread; }

hwork_queue(struct hybrid_work *hwork, struct workqueue_struct *wq)
  if (current_is_irqthread()) {
    task_work_add(current, &hwork->twork)
  } else {
    queue_work(wq, &hwork->work);
  }

(with extra bookkeeping so _flush  and _cancel_sync methods can also be
created). It would require irqthread to learn to run its queued
task_works in its main loop, which in turn would require finding some
other way to do the irq_thread_dtor() cleanup, but that should be doable.

While the implementation of hybrid_work might be a bit complex, I think
this would have potential for being used in other situations, and for
the users, the API would be as simple as the current workqueue/struct
kwork APIs. By letting the irq thread do more/all of the work, we'd
probably also win some latency due to fewer threads involved and better
cache locality. And the admin/BSP is already setting the rt priorities
of the [irq/...] threads.

Rasmus

[1]
https://lore.kernel.org/linux-rt-users/20180711080957.f6txdmzrrrrdm7ig@linutronix.de/

Sebastian Andrzej Siewior April 1, 2022, 9:21 a.m. UTC | #7

On 2022-03-29 10:33:19 [+0200], Rasmus Villemoes wrote:
> On 29/03/2022 08.30, Sebastian Andrzej Siewior wrote:
> > On 2022-03-28 07:39:25 [-1000], Tejun Heo wrote:
> >> Hello,
> > Hi,
> > 
> >> I wonder whether it'd be useful to provide a set of wrappers which can make
> >> switching between workqueue and kworker easy. Semantics-wise, they're
> >> already mostly aligned and it shouldn't be too difficult to e.g. make an
> >> unbounded workqueue be backed by a dedicated kthread_worker instead of
> >> shared pool depending on a flag, or even allow switching dynamically.
> 
> Well, that would certainly not make it any easier for userspace to
> discover the thread it needs to chrt().

It should be configured within the tty-layer and not making a working RT
just because it is possible.

> > This could work. For the tty layer it could use 'lowlatency' attribute
> > to decide which implementation makes sense.
> 
> I have patches that merely touch the tty layer, but tying it to the
> lowlatency attribute is quite painful (which has also come up in
> previous discussions on this) - because the lowlatency flag can be
> flipped from userspace, but synchronizing which variant is used and
> switching dynamically is at least beyond my skills to make work
> robustly. So in my patches, the choice is made at open() time. However,
> I'm still not convinced code like
> 
>  struct tty_bufhead {
>         struct tty_buffer *head;        /* Queue head */
>         struct work_struct work;
> +       struct kthread_work kwork;
> +       struct kthread_worker *kworker;
> 
> 
>  bool tty_buffer_restart_work(struct tty_port *port)
>  {
> -       return queue_work(system_unbound_wq, &port->buf.work);
> +       struct tty_bufhead *buf = &port->buf;
> +
> +       if (buf->kworker)
> +               return kthread_queue_work(buf->kworker, &buf->kwork);
> +       else
> +               return queue_work(system_unbound_wq, &buf->work);
>  }
> 
> etc. is the way to go.
> 
> ===
> 
> Here's another idea: In an ideal world, the irq thread itself [people
> caring about latency use threaded interrupts] could just do the work
> immediately - then the admin only has one kernel thread to properly
> configure. However, as Sebastian pointed out, doing that leads to a
> lockdep splat [1], and it also means that there's no work item involved,
> so some other thread calling tty_buffer_flush_work() might not actually
> wait for a concurrent flush_to_ldisc() to finish. So could we create a
> struct hybrid_work { } which, when enqueued, does something like
> 
>   bool current_is_irqthread(void) { return in_task() &&
> kthread_func(current) == irq_thread; }
> 
> hwork_queue(struct hybrid_work *hwork, struct workqueue_struct *wq)
>   if (current_is_irqthread()) {
>     task_work_add(current, &hwork->twork)
>   } else {
>     queue_work(wq, &hwork->work);
>   }
> 
> (with extra bookkeeping so _flush  and _cancel_sync methods can also be
> created). It would require irqthread to learn to run its queued
> task_works in its main loop, which in turn would require finding some
> other way to do the irq_thread_dtor() cleanup, but that should be doable.
> 
> While the implementation of hybrid_work might be a bit complex, I think
> this would have potential for being used in other situations, and for
> the users, the API would be as simple as the current workqueue/struct
> kwork APIs. By letting the irq thread do more/all of the work, we'd
> probably also win some latency due to fewer threads involved and better
> cache locality. And the admin/BSP is already setting the rt priorities
> of the [irq/...] threads.

Hmmm. Sounds complicated. Especially the part where irqthread needs to
deal with irq_thread_dtor in another way.
If this is something we want for everyone and not just for the "low
latency" attribute because it seems to make sense for everyone, would it
work to add the data in one step and then flush it once all locks are
dropped? The UART driver could be extended to a threaded handler if it
is not desired/ possible to complete in the primary handler.

> Rasmus

Sebastian

Rasmus Villemoes April 6, 2022, 10 a.m. UTC | #8

On 01/04/2022 11.21, Sebastian Andrzej Siewior wrote:
> On 2022-03-29 10:33:19 [+0200], Rasmus Villemoes wrote:
>> On 29/03/2022 08.30, Sebastian Andrzej Siewior wrote:
>>> On 2022-03-28 07:39:25 [-1000], Tejun Heo wrote:
>>>> Hello,
>>> Hi,
>>>
>>>> I wonder whether it'd be useful to provide a set of wrappers which can make
>>>> switching between workqueue and kworker easy. Semantics-wise, they're
>>>> already mostly aligned and it shouldn't be too difficult to e.g. make an
>>>> unbounded workqueue be backed by a dedicated kthread_worker instead of
>>>> shared pool depending on a flag, or even allow switching dynamically.
>>
>> Well, that would certainly not make it any easier for userspace to
>> discover the thread it needs to chrt().
> 
> It should be configured within the tty-layer and not making a working RT
> just because it is possible.

I'm sorry, I can't parse that sentence.

The tty-layer cannot possibly set the right RT priorities, only the
application/userspace/the BSP developer knows what is right. The kernel
has rightly standardized on just the two sched_set_fifo and
sched_set_fifo_low; the admin must configure the system, but that also
requires that the admin has access to knobs to actually do that.
>>
>> Here's another idea: In an ideal world, the irq thread itself [people
>> caring about latency use threaded interrupts] could just do the work
>> immediately - then the admin only has one kernel thread to properly
>> configure. However, as Sebastian pointed out, doing that leads to a
>> lockdep splat [1], and it also means that there's no work item involved,
>> so some other thread calling tty_buffer_flush_work() might not actually
>> wait for a concurrent flush_to_ldisc() to finish. So could we create a
>> struct hybrid_work { } which, when enqueued, does something like
>>
>>   bool current_is_irqthread(void) { return in_task() &&
>> kthread_func(current) == irq_thread; }
>>
>> hwork_queue(struct hybrid_work *hwork, struct workqueue_struct *wq)
>>   if (current_is_irqthread()) {
>>     task_work_add(current, &hwork->twork)
>>   } else {
>>     queue_work(wq, &hwork->work);
>>   }
>>
>> (with extra bookkeeping so _flush  and _cancel_sync methods can also be
>> created). It would require irqthread to learn to run its queued
>> task_works in its main loop, which in turn would require finding some
>> other way to do the irq_thread_dtor() cleanup, but that should be doable.
>>
>> While the implementation of hybrid_work might be a bit complex, I think
>> this would have potential for being used in other situations, and for
>> the users, the API would be as simple as the current workqueue/struct
>> kwork APIs. By letting the irq thread do more/all of the work, we'd
>> probably also win some latency due to fewer threads involved and better
>> cache locality. And the admin/BSP is already setting the rt priorities
>> of the [irq/...] threads.
> 
> Hmmm. Sounds complicated. Especially the part where irqthread needs to
> deal with irq_thread_dtor in another way.

Well, we wouldn't need to use the task_work mechanism, we could also add
a list_head to struct irqaction {} aka the irq thread's kthread_data().

> If this is something we want for everyone and not just for the "low
> latency" attribute because it seems to make sense for everyone, would it
> work to add the data in one step and then flush it once all locks are
> dropped? The UART driver could be extended to a threaded handler if it
> is not desired/ possible to complete in the primary handler.

Yes, the idea is certainly to create something which is applicable more
generally than just for the tty problem. There are lots of places where
one ends up with a somewhat silly situation in that the driver's irq
handler is carefully written to not do much more than just schedule a
work item, so with the -RT patch set, we wake a task so it can wake a
task so it can ... And it also means that the admin might have carefully
adjusted the rt priority of the irq/foobar kernel thread and the
consuming application, but that doesn't matter when there's some random
SCHED_OTHER task in between - i.e. exactly the tty problem.

I guess I should write some real patches to explain what I mean more
clearly.

Rasmus

[RFC,0/2] RT scheduling policies for workqueues

Message

Comments