Message ID | 20220323145600.2156689-1-linux@rasmusvillemoes.dk |
---|---|
Headers | show |
Series | RT scheduling policies for workqueues | expand |
On 2022-03-23 15:55:58 [+0100], Rasmus Villemoes wrote: > This RFC is motivated by an old problem in the tty layer. Ever since > commit a9c3f68f3cd8 (tty: Fix low_latency BUG), use of UART for > real-time applications has been problematic. Even if both the > application itself and the irq thread are set to SCHED_FIFO, the fact > that the flush_to_ldisc work is scheduled on the generic and global > system_unbound_wq (with all workers running at normal scheduling > priority) means that UART RX can suffer unbounded latency. Having a kthread per "low-latency" tty instance is something I would prefer. The kwork corner is an anonymous worker instance and probably does more harm than good. Especially if it is a knob for everyone which is used for the wrong reasons and manages to be harmful in the end. With a special kthread for a particular tty, the thread can be assigned with the desired priority within the system and ttyS1 can be distinguished from ttyS0 (and so on). This turned out to be useful in a few setups over the years. Sebastian
On 28.03.2022 12:05:57, Sebastian Andrzej Siewior wrote: > On 2022-03-23 15:55:58 [+0100], Rasmus Villemoes wrote: > > This RFC is motivated by an old problem in the tty layer. Ever since > > commit a9c3f68f3cd8 (tty: Fix low_latency BUG), use of UART for > > real-time applications has been problematic. Even if both the > > application itself and the irq thread are set to SCHED_FIFO, the fact > > that the flush_to_ldisc work is scheduled on the generic and global > > system_unbound_wq (with all workers running at normal scheduling > > priority) means that UART RX can suffer unbounded latency. > > Having a kthread per "low-latency" tty instance is something I would > prefer. The kwork corner is an anonymous worker instance and probably > does more harm than good. Especially if it is a knob for everyone which > is used for the wrong reasons and manages to be harmful in the end. > With a special kthread for a particular tty, the thread can be assigned > with the desired priority within the system and ttyS1 can be > distinguished from ttyS0 (and so on). This turned out to be useful in a > few setups over the years. +1 The networking subsystem has gone the same/similar way with NAPI. NAPI handling can be switched from the softirq to kernel thread on a per interface basis. regards Marc
Hello, On Mon, Mar 28, 2022 at 12:09:27PM +0200, Marc Kleine-Budde wrote: > > Having a kthread per "low-latency" tty instance is something I would > > prefer. The kwork corner is an anonymous worker instance and probably > > does more harm than good. Especially if it is a knob for everyone which > > is used for the wrong reasons and manages to be harmful in the end. > > With a special kthread for a particular tty, the thread can be assigned > > with the desired priority within the system and ttyS1 can be > > distinguished from ttyS0 (and so on). This turned out to be useful in a > > few setups over the years. > > +1 > > The networking subsystem has gone the same/similar way with NAPI. NAPI > handling can be switched from the softirq to kernel thread on a per > interface basis. I wonder whether it'd be useful to provide a set of wrappers which can make switching between workqueue and kworker easy. Semantics-wise, they're already mostly aligned and it shouldn't be too difficult to e.g. make an unbounded workqueue be backed by a dedicated kthread_worker instead of shared pool depending on a flag, or even allow switching dynamically. Thanks.
On 28.03.2022 07:39:25, Tejun Heo wrote: > Hello, > > On Mon, Mar 28, 2022 at 12:09:27PM +0200, Marc Kleine-Budde wrote: > > > Having a kthread per "low-latency" tty instance is something I would > > > prefer. The kwork corner is an anonymous worker instance and probably > > > does more harm than good. Especially if it is a knob for everyone which > > > is used for the wrong reasons and manages to be harmful in the end. > > > With a special kthread for a particular tty, the thread can be assigned > > > with the desired priority within the system and ttyS1 can be > > > distinguished from ttyS0 (and so on). This turned out to be useful in a > > > few setups over the years. > > > > +1 > > > > The networking subsystem has gone the same/similar way with NAPI. NAPI > > handling can be switched from the softirq to kernel thread on a per > > interface basis. > > I wonder whether it'd be useful to provide a set of wrappers which can make > switching between workqueue and kworker easy. Semantics-wise, they're > already mostly aligned and it shouldn't be too difficult to e.g. make an > unbounded workqueue be backed by a dedicated kthread_worker instead of > shared pool depending on a flag, or even allow switching dynamically. For NAPI a sysfs entry was added to switch to threaded mode: | 5fdd2f0e5c64 net: add sysfs attribute to control napi threaded mode | 29863d41bb6e net: implement threaded-able napi poll loop support | 898f8015ffe7 net: extract napi poll functionality to __napi_poll() regards, Marc
On 2022-03-28 07:39:25 [-1000], Tejun Heo wrote: > Hello, Hi, > I wonder whether it'd be useful to provide a set of wrappers which can make > switching between workqueue and kworker easy. Semantics-wise, they're > already mostly aligned and it shouldn't be too difficult to e.g. make an > unbounded workqueue be backed by a dedicated kthread_worker instead of > shared pool depending on a flag, or even allow switching dynamically. This could work. For the tty layer it could use 'lowlatency' attribute to decide which implementation makes sense. > Thanks. > Sebastian
On 29/03/2022 08.30, Sebastian Andrzej Siewior wrote: > On 2022-03-28 07:39:25 [-1000], Tejun Heo wrote: >> Hello, > Hi, > >> I wonder whether it'd be useful to provide a set of wrappers which can make >> switching between workqueue and kworker easy. Semantics-wise, they're >> already mostly aligned and it shouldn't be too difficult to e.g. make an >> unbounded workqueue be backed by a dedicated kthread_worker instead of >> shared pool depending on a flag, or even allow switching dynamically. Well, that would certainly not make it any easier for userspace to discover the thread it needs to chrt(). > This could work. For the tty layer it could use 'lowlatency' attribute > to decide which implementation makes sense. I have patches that merely touch the tty layer, but tying it to the lowlatency attribute is quite painful (which has also come up in previous discussions on this) - because the lowlatency flag can be flipped from userspace, but synchronizing which variant is used and switching dynamically is at least beyond my skills to make work robustly. So in my patches, the choice is made at open() time. However, I'm still not convinced code like struct tty_bufhead { struct tty_buffer *head; /* Queue head */ struct work_struct work; + struct kthread_work kwork; + struct kthread_worker *kworker; bool tty_buffer_restart_work(struct tty_port *port) { - return queue_work(system_unbound_wq, &port->buf.work); + struct tty_bufhead *buf = &port->buf; + + if (buf->kworker) + return kthread_queue_work(buf->kworker, &buf->kwork); + else + return queue_work(system_unbound_wq, &buf->work); } etc. is the way to go. === Here's another idea: In an ideal world, the irq thread itself [people caring about latency use threaded interrupts] could just do the work immediately - then the admin only has one kernel thread to properly configure. However, as Sebastian pointed out, doing that leads to a lockdep splat [1], and it also means that there's no work item involved, so some other thread calling tty_buffer_flush_work() might not actually wait for a concurrent flush_to_ldisc() to finish. So could we create a struct hybrid_work { } which, when enqueued, does something like bool current_is_irqthread(void) { return in_task() && kthread_func(current) == irq_thread; } hwork_queue(struct hybrid_work *hwork, struct workqueue_struct *wq) if (current_is_irqthread()) { task_work_add(current, &hwork->twork) } else { queue_work(wq, &hwork->work); } (with extra bookkeeping so _flush and _cancel_sync methods can also be created). It would require irqthread to learn to run its queued task_works in its main loop, which in turn would require finding some other way to do the irq_thread_dtor() cleanup, but that should be doable. While the implementation of hybrid_work might be a bit complex, I think this would have potential for being used in other situations, and for the users, the API would be as simple as the current workqueue/struct kwork APIs. By letting the irq thread do more/all of the work, we'd probably also win some latency due to fewer threads involved and better cache locality. And the admin/BSP is already setting the rt priorities of the [irq/...] threads. Rasmus [1] https://lore.kernel.org/linux-rt-users/20180711080957.f6txdmzrrrrdm7ig@linutronix.de/
On 2022-03-29 10:33:19 [+0200], Rasmus Villemoes wrote: > On 29/03/2022 08.30, Sebastian Andrzej Siewior wrote: > > On 2022-03-28 07:39:25 [-1000], Tejun Heo wrote: > >> Hello, > > Hi, > > > >> I wonder whether it'd be useful to provide a set of wrappers which can make > >> switching between workqueue and kworker easy. Semantics-wise, they're > >> already mostly aligned and it shouldn't be too difficult to e.g. make an > >> unbounded workqueue be backed by a dedicated kthread_worker instead of > >> shared pool depending on a flag, or even allow switching dynamically. > > Well, that would certainly not make it any easier for userspace to > discover the thread it needs to chrt(). It should be configured within the tty-layer and not making a working RT just because it is possible. > > This could work. For the tty layer it could use 'lowlatency' attribute > > to decide which implementation makes sense. > > I have patches that merely touch the tty layer, but tying it to the > lowlatency attribute is quite painful (which has also come up in > previous discussions on this) - because the lowlatency flag can be > flipped from userspace, but synchronizing which variant is used and > switching dynamically is at least beyond my skills to make work > robustly. So in my patches, the choice is made at open() time. However, > I'm still not convinced code like > > struct tty_bufhead { > struct tty_buffer *head; /* Queue head */ > struct work_struct work; > + struct kthread_work kwork; > + struct kthread_worker *kworker; > > > bool tty_buffer_restart_work(struct tty_port *port) > { > - return queue_work(system_unbound_wq, &port->buf.work); > + struct tty_bufhead *buf = &port->buf; > + > + if (buf->kworker) > + return kthread_queue_work(buf->kworker, &buf->kwork); > + else > + return queue_work(system_unbound_wq, &buf->work); > } > > etc. is the way to go. > > === > > Here's another idea: In an ideal world, the irq thread itself [people > caring about latency use threaded interrupts] could just do the work > immediately - then the admin only has one kernel thread to properly > configure. However, as Sebastian pointed out, doing that leads to a > lockdep splat [1], and it also means that there's no work item involved, > so some other thread calling tty_buffer_flush_work() might not actually > wait for a concurrent flush_to_ldisc() to finish. So could we create a > struct hybrid_work { } which, when enqueued, does something like > > bool current_is_irqthread(void) { return in_task() && > kthread_func(current) == irq_thread; } > > hwork_queue(struct hybrid_work *hwork, struct workqueue_struct *wq) > if (current_is_irqthread()) { > task_work_add(current, &hwork->twork) > } else { > queue_work(wq, &hwork->work); > } > > (with extra bookkeeping so _flush and _cancel_sync methods can also be > created). It would require irqthread to learn to run its queued > task_works in its main loop, which in turn would require finding some > other way to do the irq_thread_dtor() cleanup, but that should be doable. > > While the implementation of hybrid_work might be a bit complex, I think > this would have potential for being used in other situations, and for > the users, the API would be as simple as the current workqueue/struct > kwork APIs. By letting the irq thread do more/all of the work, we'd > probably also win some latency due to fewer threads involved and better > cache locality. And the admin/BSP is already setting the rt priorities > of the [irq/...] threads. Hmmm. Sounds complicated. Especially the part where irqthread needs to deal with irq_thread_dtor in another way. If this is something we want for everyone and not just for the "low latency" attribute because it seems to make sense for everyone, would it work to add the data in one step and then flush it once all locks are dropped? The UART driver could be extended to a threaded handler if it is not desired/ possible to complete in the primary handler. > Rasmus Sebastian
On 01/04/2022 11.21, Sebastian Andrzej Siewior wrote: > On 2022-03-29 10:33:19 [+0200], Rasmus Villemoes wrote: >> On 29/03/2022 08.30, Sebastian Andrzej Siewior wrote: >>> On 2022-03-28 07:39:25 [-1000], Tejun Heo wrote: >>>> Hello, >>> Hi, >>> >>>> I wonder whether it'd be useful to provide a set of wrappers which can make >>>> switching between workqueue and kworker easy. Semantics-wise, they're >>>> already mostly aligned and it shouldn't be too difficult to e.g. make an >>>> unbounded workqueue be backed by a dedicated kthread_worker instead of >>>> shared pool depending on a flag, or even allow switching dynamically. >> >> Well, that would certainly not make it any easier for userspace to >> discover the thread it needs to chrt(). > > It should be configured within the tty-layer and not making a working RT > just because it is possible. I'm sorry, I can't parse that sentence. The tty-layer cannot possibly set the right RT priorities, only the application/userspace/the BSP developer knows what is right. The kernel has rightly standardized on just the two sched_set_fifo and sched_set_fifo_low; the admin must configure the system, but that also requires that the admin has access to knobs to actually do that. >> >> Here's another idea: In an ideal world, the irq thread itself [people >> caring about latency use threaded interrupts] could just do the work >> immediately - then the admin only has one kernel thread to properly >> configure. However, as Sebastian pointed out, doing that leads to a >> lockdep splat [1], and it also means that there's no work item involved, >> so some other thread calling tty_buffer_flush_work() might not actually >> wait for a concurrent flush_to_ldisc() to finish. So could we create a >> struct hybrid_work { } which, when enqueued, does something like >> >> bool current_is_irqthread(void) { return in_task() && >> kthread_func(current) == irq_thread; } >> >> hwork_queue(struct hybrid_work *hwork, struct workqueue_struct *wq) >> if (current_is_irqthread()) { >> task_work_add(current, &hwork->twork) >> } else { >> queue_work(wq, &hwork->work); >> } >> >> (with extra bookkeeping so _flush and _cancel_sync methods can also be >> created). It would require irqthread to learn to run its queued >> task_works in its main loop, which in turn would require finding some >> other way to do the irq_thread_dtor() cleanup, but that should be doable. >> >> While the implementation of hybrid_work might be a bit complex, I think >> this would have potential for being used in other situations, and for >> the users, the API would be as simple as the current workqueue/struct >> kwork APIs. By letting the irq thread do more/all of the work, we'd >> probably also win some latency due to fewer threads involved and better >> cache locality. And the admin/BSP is already setting the rt priorities >> of the [irq/...] threads. > > Hmmm. Sounds complicated. Especially the part where irqthread needs to > deal with irq_thread_dtor in another way. Well, we wouldn't need to use the task_work mechanism, we could also add a list_head to struct irqaction {} aka the irq thread's kthread_data(). > If this is something we want for everyone and not just for the "low > latency" attribute because it seems to make sense for everyone, would it > work to add the data in one step and then flush it once all locks are > dropped? The UART driver could be extended to a threaded handler if it > is not desired/ possible to complete in the primary handler. Yes, the idea is certainly to create something which is applicable more generally than just for the tty problem. There are lots of places where one ends up with a somewhat silly situation in that the driver's irq handler is carefully written to not do much more than just schedule a work item, so with the -RT patch set, we wake a task so it can wake a task so it can ... And it also means that the admin might have carefully adjusted the rt priority of the irq/foobar kernel thread and the consuming application, but that doesn't matter when there's some random SCHED_OTHER task in between - i.e. exactly the tty problem. I guess I should write some real patches to explain what I mean more clearly. Rasmus