Message ID | 24a844c3-c2e0-c735-ccb7-83736218b548@gmail.com |
---|---|
State | New |
Headers | show |
Series | mmc: host: meson-gx-mmc: fix possible deadlock condition for preempt_rt | expand |
Hi Brad, Brad Harper <bjharper@gmail.com> writes: > Force threaded interrupts for meson_mmc_irq to prevent possible deadlock > condition > during mmc operations when using preempt_rt with 5.9.0-rc3-rt3 patches > on arm64. > > Using meson-gx-mmc with an emmc device on Hardkernel Odroid N2+ > configured with > preempt_rt resulted in the soc becoming unresponsive. With lock > checking enabled > the below inconsistent lock state was observed during boot. > > After some discussions with tglx in IRC #linux-rt the attached patch was > suggested > to remove IRQF_ONESHOT from request_threaded_irq. > This has been tested and confirmed by me to resolve both the > unresponsive soc and > the inconsistent lock state warning when using 5.9.0-rc3-rt3 on arm64 > Odroid N2+. > > Further review and testing is required to ensure there are no adverse > impacts or > concerns and that is the correct method to resolve the problem. I will > continue > to test on various amlogic devices with both standard mainline low > latency kernel > and preempt_rt kernel with -rt patches. This looks right to me, thanks for sending a fix. For broader testing, I can add this to my testing branch so it gets booted on a bunch more platform in KernelCI also. However... [...] > Signed-off-by: Brad Harper <bjharper@gmail.com> > --- > drivers/mmc/host/meson-gx-mmc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/mmc/host/meson-gx-mmc.c > b/drivers/mmc/host/meson-gx-mmc.c > index 08a3b1c05..130ac134d 100644 > --- a/drivers/mmc/host/meson-gx-mmc.c > +++ b/drivers/mmc/host/meson-gx-mmc.c > @@ -1139,7 +1139,7 @@ static int meson_mmc_probe(struct platform_device > *pdev) > host->regs + SD_EMMC_IRQ_EN); > > ret = request_threaded_irq(host->irq, meson_mmc_irq, > - meson_mmc_irq_thread, IRQF_ONESHOT, > + meson_mmc_irq_thread, 0, > dev_name(&pdev->dev), host); > if (ret) > goto err_init_clk; This patch has been mangled by your mailer, so it doesn't apply cleanly. If you're using the gmail web UI, this is a common problem. I strongly recommend using git-send-email to send directly via gmail SMTP. The git-send-email docs[1] give some examples on how to set this up. Kevin [1] https://git-scm.com/docs/git-send-email#_examples
On Thu 24 Sep 2020 at 19:01, Kevin Hilman <khilman@baylibre.com> wrote: > Hi Brad, > > Brad Harper <bjharper@gmail.com> writes: > >> Force threaded interrupts for meson_mmc_irq to prevent possible deadlock >> condition >> during mmc operations when using preempt_rt with 5.9.0-rc3-rt3 patches >> on arm64. >> >> Using meson-gx-mmc with an emmc device on Hardkernel Odroid N2+ >> configured with >> preempt_rt resulted in the soc becoming unresponsive. With lock >> checking enabled >> the below inconsistent lock state was observed during boot. >> >> After some discussions with tglx in IRC #linux-rt the attached patch was >> suggested >> to remove IRQF_ONESHOT from request_threaded_irq. >> This has been tested and confirmed by me to resolve both the >> unresponsive soc and >> the inconsistent lock state warning when using 5.9.0-rc3-rt3 on arm64 >> Odroid N2+. >> >> Further review and testing is required to ensure there are no adverse >> impacts or >> concerns and that is the correct method to resolve the problem. I will >> continue >> to test on various amlogic devices with both standard mainline low >> latency kernel >> and preempt_rt kernel with -rt patches. > > This looks right to me, thanks for sending a fix. > > For broader testing, I can add this to my testing branch so it gets > booted on a bunch more platform in KernelCI also. > > However... > > [...] > >> Signed-off-by: Brad Harper <bjharper@gmail.com> >> --- >> drivers/mmc/host/meson-gx-mmc.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/mmc/host/meson-gx-mmc.c >> b/drivers/mmc/host/meson-gx-mmc.c >> index 08a3b1c05..130ac134d 100644 >> --- a/drivers/mmc/host/meson-gx-mmc.c >> +++ b/drivers/mmc/host/meson-gx-mmc.c >> @@ -1139,7 +1139,7 @@ static int meson_mmc_probe(struct platform_device >> *pdev) >> host->regs + SD_EMMC_IRQ_EN); >> >> ret = request_threaded_irq(host->irq, meson_mmc_irq, >> - meson_mmc_irq_thread, IRQF_ONESHOT, >> + meson_mmc_irq_thread, 0, >> dev_name(&pdev->dev), host); I'm not sure about this. As you have explained on IRC, I understand that IRQF_ONESHOT is causing trouble with RT as the hard IRQ part of the thread will not be migrated to a thread. That was certainly not the intent when putting this flag. As described in include/linux/interrupt.h: * IRQF_ONESHOT - Interrupt is not reenabled after the hardirq handler finished. * Used by threaded interrupts which need to keep the * irq line disabled until the threaded handler has been run. The intent was only that, making sure the irq is not re-enabled until the threaded part is done. AFAIU, removing this flag would allow the hard irq handler to fire again while we are not done handling the previous IRQ in threaded part. This seems pretty unsafe to me. Maybe we could improve the driver so it copes with this case gracefully. ATM, I don't think it would. Maybe I missed something and I am happy to be enlightened if that is the case :) >> if (ret) >> goto err_init_clk; > > This patch has been mangled by your mailer, so it doesn't apply cleanly. > If you're using the gmail web UI, this is a common problem. > > I strongly recommend using git-send-email to send directly via gmail > SMTP. The git-send-email docs[1] give some examples on how to set this > up. > > Kevin > > [1] https://git-scm.com/docs/git-send-email#_examples > > _______________________________________________ > linux-amlogic mailing list > linux-amlogic@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-amlogic
On 2020-09-25 11:11:42 [+0200], Jerome Brunet wrote: > I'm not sure about this. > As you have explained on IRC, I understand that IRQF_ONESHOT is causing > trouble with RT as the hard IRQ part of the thread will not be migrated > to a thread. That was certainly not the intent when putting this flag. That is my understanding as well. > This seems pretty unsafe to me. Maybe we could improve the driver so it > copes with this case gracefully. ATM, I don't think it would. Running the primary handler in hardirq context is bad, because it invokes meson_mmc_request_done() at the very end. And here: - mmc_complete_cmd() -> complete_all() There is a lockdep_assert_RT_in_threaded_ctx() which should trigger. - led_trigger_event() -> led_trigger_event() This should trigger a might_sleep() warning somewhere. So removing IRQF_ONESHOT is okay but it should additionally disable the IRQ source in meson_mmc_irq() and re-enable back in meson_mmc_irq_thread(). Otherwise the IRQ remains asserted and may fire multiple times before the thread has a chance to run. Sebastian
On Fri 25 Sep 2020 at 15:44, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > On 2020-09-25 11:11:42 [+0200], Jerome Brunet wrote: >> I'm not sure about this. >> As you have explained on IRC, I understand that IRQF_ONESHOT is causing >> trouble with RT as the hard IRQ part of the thread will not be migrated >> to a thread. That was certainly not the intent when putting this flag. > > That is my understanding as well. > >> This seems pretty unsafe to me. Maybe we could improve the driver so it >> copes with this case gracefully. ATM, I don't think it would. > > Running the primary handler in hardirq context is bad, because it > invokes meson_mmc_request_done() at the very end. And here: > - mmc_complete_cmd() -> complete_all() > There is a lockdep_assert_RT_in_threaded_ctx() which should trigger. > > - led_trigger_event() -> led_trigger_event() > This should trigger a might_sleep() warning somewhere. > > So removing IRQF_ONESHOT is okay but it should additionally disable the > IRQ source in meson_mmc_irq() and re-enable back in > meson_mmc_irq_thread(). Otherwise the IRQ remains asserted and may fire > multiple times before the thread has a chance to run. Looks like we need to do manually what IRQF_ONESHOT was doing for us :( This brings a few questions: * The consideration you described is not mentioned near the description of IRQF_ONESHOT. Maybe it should so other drivers with same intent don't end up in the same pitfall ? * Why doesn't RT move the IRQ with this flag ? Seems completly unrelated to RT (maybe it is the same documentation problem) * Can't we have flag doing the irq disable in the same way while still allowing to RT to do its magic ? seems better than open coding it in the driver ? > > Sebastian
diff --git a/drivers/mmc/host/meson-gx-mmc.c b/drivers/mmc/host/meson-gx-mmc.c index 08a3b1c05..130ac134d 100644 --- a/drivers/mmc/host/meson-gx-mmc.c +++ b/drivers/mmc/host/meson-gx-mmc.c @@ -1139,7 +1139,7 @@ static int meson_mmc_probe(struct platform_device *pdev) host->regs + SD_EMMC_IRQ_EN); ret = request_threaded_irq(host->irq, meson_mmc_irq, - meson_mmc_irq_thread, IRQF_ONESHOT, + meson_mmc_irq_thread, 0, dev_name(&pdev->dev), host); if (ret)
Force threaded interrupts for meson_mmc_irq to prevent possible deadlock condition during mmc operations when using preempt_rt with 5.9.0-rc3-rt3 patches on arm64. Using meson-gx-mmc with an emmc device on Hardkernel Odroid N2+ configured with preempt_rt resulted in the soc becoming unresponsive. With lock checking enabled the below inconsistent lock state was observed during boot. After some discussions with tglx in IRC #linux-rt the attached patch was suggested to remove IRQF_ONESHOT from request_threaded_irq. This has been tested and confirmed by me to resolve both the unresponsive soc and the inconsistent lock state warning when using 5.9.0-rc3-rt3 on arm64 Odroid N2+. Further review and testing is required to ensure there are no adverse impacts or concerns and that is the correct method to resolve the problem. I will continue to test on various amlogic devices with both standard mainline low latency kernel and preempt_rt kernel with -rt patches. [ 7.858446] ================================ [ 7.858448] WARNING: inconsistent lock state [ 7.858450] 5.9.0-rc3-rt3+ #33 Not tainted [ 7.858453] -------------------------------- [ 7.858456] inconsistent {IN-HARDIRQ-R} -> {HARDIRQ-ON-W} usage. [ 7.858459] swapper/0/1 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 7.858465] ffff80001219f4d8 (&trig->leddev_list_lock){+?.+}-{0:0}, at: led_trigger_set+0x104/0x270 [ 7.858482] {IN-HARDIRQ-R} state was registered at: [ 7.858484] lock_acquire+0xec/0x468 [ 7.858491] rt_read_lock+0xb0/0x108 [ 7.858497] led_trigger_event+0x34/0x88 [ 7.858501] mmc_request_done+0x3f0/0x450 [ 7.858505] meson_mmc_irq+0x284/0x378 [ 7.858511] __handle_irq_event_percpu+0xcc/0x4a8 [ 7.858515] handle_irq_event_percpu+0x60/0xb0 [ 7.858519] handle_irq_event+0x50/0x108 [ 7.858522] handle_fasteoi_irq+0xd0/0x180 [ 7.858527] generic_handle_irq+0x38/0x50 [ 7.858530] __handle_domain_irq+0x6c/0xc8 [ 7.858533] gic_handle_irq+0x5c/0xb8 [ 7.858537] el1_irq+0xbc/0x180 [ 7.858540] arch_cpu_idle+0x28/0x38 [ 7.858544] default_idle_call+0x90/0x3f0 [ 7.858547] do_idle+0x250/0x268 [ 7.858551] cpu_startup_entry+0x2c/0x78 [ 7.858554] rest_init+0x1b0/0x284 [ 7.858559] arch_call_rest_init+0x18/0x24 [ 7.858565] start_kernel+0x550/0x588 [ 7.858569] irq event stamp: 1925495 [ 7.858571] hardirqs last enabled at (1925495): [<ffff8000111e3ba4>] _raw_spin_unlock_irqrestore+0xa4/0xb0 [ 7.858576] hardirqs last disabled at (1925494): [<ffff8000111e3c58>] _raw_spin_lock_irqsave+0xa8/0xb8 [ 7.858580] softirqs last enabled at (1857856): [<ffff80001024705c>] bdi_register_va+0x114/0x368 [ 7.858586] softirqs last disabled at (1857849): [<ffff80001024705c>] bdi_register_va+0x114/0x368 [ 7.858590] other info that might help us debug this: [ 7.858592] Possible unsafe locking scenario: [ 7.858594] CPU0 [ 7.858595] ---- [ 7.858597] lock(&trig->leddev_list_lock); [ 7.858600] <Interrupt> [ 7.858602] lock(&trig->leddev_list_lock); [ 7.858604] *** DEADLOCK *** [ 7.858606] 3 locks held by swapper/0/1: [ 7.858609] #0: ffff80001219eb30 (leds_list_lock){++++}-{0:0}, at: led_trigger_register+0xf4/0x1c0 [ 7.858619] #1: ffff0000b0696a70 (&led_cdev->trigger_lock){+.+.}-{0:0}, at: led_trigger_register+0x134/0x1c0 [ 7.858629] #2: ffff800011fb83d0 (rcu_read_lock){....}-{1:2}, at: rt_write_lock+0x8/0x108 [ 7.858637] stack backtrace: [ 7.858640] CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc3-rt3+ #33 [ 7.858643] Hardware name: Hardkernel ODROID-N2Plus (DT) [ 7.858646] Call trace: [ 7.858647] dump_backtrace+0x0/0x1e8 [ 7.858650] show_stack+0x20/0x30 [ 7.858653] dump_stack+0xf0/0x164 [ 7.858659] print_usage_bug+0x2b4/0x2c0 [ 7.858662] mark_lock+0x2e8/0x360 [ 7.858665] __lock_acquire+0x238/0x1858 [ 7.858669] lock_acquire+0xec/0x468 [ 7.858672] rt_write_lock+0xb0/0x108 [ 7.858675] led_trigger_set+0x104/0x270 [ 7.858678] led_trigger_register+0x180/0x1c0 [ 7.858681] heartbeat_trig_init+0x28/0x5c [ 7.858686] do_one_initcall+0x90/0x4bc [ 7.858690] kernel_init_freeable+0x2cc/0x338 [ 7.858694] kernel_init+0x1c/0x11c [ 7.858697] ret_from_fork+0x10/0x34 Signed-off-by: Brad Harper <bjharper@gmail.com> --- drivers/mmc/host/meson-gx-mmc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) goto err_init_clk; -- 2.20.1