Message ID | 20220708162702.1758865-1-john@metanate.com |
---|---|
State | New |
Headers | show |
Series | [v3] sched/core: Always flush pending blk_plug | expand |
[[PATCH v3] sched/core: Always flush pending blk_plug] On 08/07/2022 (Fri 17:27) John Keeping wrote: > With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two > normal priority tasks (SCHED_OTHER, nice level zero): > > INFO: task kworker/u8:0:8 blocked for more than 491 seconds. > Not tainted 5.15.49-rt46 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000 > Workqueue: writeback wb_workfn (flush-7:0) This is great. I was on-and-off looking at an ext4 deadlock triggered by LTP's "ftest01" - and was just potting my notes together this morning to clarify for myself what I thought was going on before I forgot. Which is good, because I then could recognize the very similar patterns in what you saw to what I'd observed/deduced. Otherise I'd probably not have had your backtraces grab my attention at all. Long story short, I applied this to v5.19-rc3-rt5 and it resolves the LTP/ext4 deadlock issue. So... Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com> As I'd already put together my ext4 debug info, I'll include it below in case it helps future people searching for LTP and ftest01. Thanks, Paul. ----------------- There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP "ftest01" Reproducer details are at the bottom. What seems to happen is as follows: The kworker doing writeback wb_workfn (pid 213 below) stalls on the one and only down_write in ext4_map_blocks() and never moves again. /* * New blocks allocate and/or writing to unwritten extent * will possibly result in updating i_data, so we take * the write lock of i_data_sem, and call get_block() * with create == 1 flag. */ down_write(&EXT4_I(inode)->i_data_sem); So who is blocking the kworker? There are two of the ftest01 processes (pid 818 and pid 821) doing ext4 truncate which look like this: ext4_truncate+0x46a/0x660 --> ext4_ext_truncate+0xd6/0xf0 --> ext4_ext_remove_space+0x59f/0x1830 --> jbd2_journal_get_write_access+0x51/0x80 --> do_get_write_access+0x2ad/0x550 --> complete_all+0x60/0x60 --> bit_wait_io+0x11/0x60 They too are stalled in D state waiting there forever - presumably waiting on the kwowrker to make the progress they are waiting for. Here is where I think lockdep shows us the problem: 5 locks held by ftest01/818: [...] #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 5 locks held by ftest01/821: [...] #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 Yes - that is the one the kworker was trying the down_write() on. Also note that the bit_wait is a big telltale sign here - there is only one in the jbd2's do_get_write_access() function: /* * There is one case we have to be very careful about. If the * committing transaction is currently writing this buffer out to disk * and has NOT made a copy-out, then we cannot modify the buffer * contents at all right now. The essence of copy-out is that it is * the extra copy, not the primary copy, which gets journaled. If the * primary copy is already going to disk then we cannot do copy-out * here. */ if (buffer_shadow(bh)) { JBUFFER_TRACE(jh, "on shadow: sleep"); spin_unlock(&jh->b_state_lock); wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE); goto repeat; } Presumaby this is a corner case that preempt-rt makes easier to trigger? The full trace with interspersed lockdep info follows. Various detailed reproducer information follows that. ----------------------- sysrq-w backtracce information interspersed with sysrq-d lockdep info on 5.19-rc2-rt3 kernel: task:kworker/u129:34 state:D stack: 0 pid: 213 ppid: 2 flags:0x00004000 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __schedule+0x329/0xae0 ? __this_cpu_preempt_check+0x13/0x20 ? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40 schedule+0x48/0xc0 __rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40 rwbase_write_lock+0x98/0x6e0 down_write+0x46/0x50 ? ext4_map_blocks+0x156/0x5c0 ext4_map_blocks+0x156/0x5c0 ext4_writepages+0xa1a/0x1290 ? __this_cpu_preempt_check+0x13/0x20 ? lockdep_hardirqs_on+0xcd/0x150 do_writepages+0xd2/0x1a0 __writeback_single_inode+0x64/0x850 writeback_sb_inodes+0x22d/0x5b0 __writeback_inodes_wb+0x67/0xe0 wb_writeback+0x269/0x5f0 ? get_nr_inodes+0x49/0x70 wb_workfn+0x43d/0x780 ? __this_cpu_preempt_check+0x13/0x20 process_one_work+0x286/0x710 worker_thread+0x3c/0x3d0 ? process_one_work+0x710/0x710 kthread+0x13b/0x150 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> 6 locks held by kworker/u129:34/213: #0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710 #1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710 #2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50 #3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0 #4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 #5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0 task:jbd2/loop0-8 state:D stack: 0 pid: 679 ppid: 2 flags:0x00004000 Call Trace: <TASK> __schedule+0x329/0xae0 schedule+0x48/0xc0 jbd2_journal_wait_updates+0x7c/0xf0 ? schedstat_stop+0x10/0x10 jbd2_journal_commit_transaction+0x284/0x20e0 ? __this_cpu_preempt_check+0x13/0x20 ? find_held_lock+0x35/0xa0 ? _raw_spin_unlock_irqrestore+0x86/0x8f ? _raw_spin_unlock_irqrestore+0x86/0x8f ? try_to_del_timer_sync+0x53/0x80 ? trace_hardirqs_on+0x3b/0x120 ? try_to_del_timer_sync+0x53/0x80 kjournald2+0xd5/0x2b0 ? schedstat_stop+0x10/0x10 ? commit_timeout+0x20/0x20 kthread+0x13b/0x150 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> task:ftest01 state:D stack: 0 pid: 817 ppid: 816 flags:0x00004002 Call Trace: <TASK> __schedule+0x329/0xae0 ? __x64_sys_tee+0xd0/0xd0 ? __x64_sys_tee+0xd0/0xd0 schedule+0x48/0xc0 wb_wait_for_completion+0x62/0x90 ? schedstat_stop+0x10/0x10 sync_inodes_sb+0xdd/0x460 ? __this_cpu_preempt_check+0x13/0x20 ? __x64_sys_tee+0xd0/0xd0 sync_inodes_one_sb+0x15/0x20 iterate_supers+0x94/0x100 ksys_sync+0x42/0xa0 __do_sys_sync+0xe/0x20 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f9dae2cdb RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003 </TASK> 1 lock held by ftest01/817: #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 task:ftest01 state:D stack: 0 pid: 818 ppid: 816 flags:0x00004002 Call Trace: <TASK> __schedule+0x329/0xae0 ? bit_wait+0x60/0x60 schedule+0x48/0xc0 io_schedule+0x16/0x40 bit_wait_io+0x11/0x60 __wait_on_bit_lock+0x5e/0xb0 out_of_line_wait_on_bit_lock+0x93/0xb0 ? complete_all+0x60/0x60 __lock_buffer+0x3f/0x50 do_get_write_access+0x2ad/0x550 jbd2_journal_get_write_access+0x51/0x80 __ext4_journal_get_write_access+0xf9/0x1a0 ext4_ext_get_access.isra.32+0x34/0x50 ext4_ext_remove_space+0x59f/0x1830 ? ext4_ext_truncate+0xa2/0xf0 ? ext4_ext_truncate+0x5e/0xf0 ext4_ext_truncate+0xd6/0xf0 ext4_truncate+0x46a/0x660 ext4_setattr+0x6db/0xa40 notify_change+0x3e5/0x540 do_truncate+0x7b/0xc0 ? do_truncate+0x7b/0xc0 do_sys_ftruncate+0x164/0x2f0 __x64_sys_ftruncate+0x1b/0x20 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f9dae3f7b RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003 RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0 R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003 </TASK> 5 locks held by ftest01/818: #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20 #1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0 #2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40 #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 task:ftest01 state:D stack: 0 pid: 819 ppid: 816 flags:0x00004002 Call Trace: <TASK> __schedule+0x329/0xae0 ? __x64_sys_tee+0xd0/0xd0 ? __x64_sys_tee+0xd0/0xd0 schedule+0x48/0xc0 wb_wait_for_completion+0x62/0x90 ? schedstat_stop+0x10/0x10 sync_inodes_sb+0xdd/0x460 ? __this_cpu_preempt_check+0x13/0x20 ? __x64_sys_tee+0xd0/0xd0 sync_inodes_one_sb+0x15/0x20 iterate_supers+0x94/0x100 ksys_sync+0x42/0xa0 __do_sys_sync+0xe/0x20 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f9dae2cdb RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000 R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003 </TASK> 1 lock held by ftest01/819: #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 task:ftest01 state:D stack: 0 pid: 820 ppid: 816 flags:0x00004002 Call Trace: <TASK> __schedule+0x329/0xae0 ? __x64_sys_tee+0xd0/0xd0 ? __x64_sys_tee+0xd0/0xd0 schedule+0x48/0xc0 wb_wait_for_completion+0x62/0x90 ? schedstat_stop+0x10/0x10 sync_inodes_sb+0xdd/0x460 ? __this_cpu_preempt_check+0x13/0x20 ? __x64_sys_tee+0xd0/0xd0 sync_inodes_one_sb+0x15/0x20 iterate_supers+0x94/0x100 ksys_sync+0x42/0xa0 __do_sys_sync+0xe/0x20 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f9dae2cdb RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000 R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003 </TASK> 1 lock held by ftest01/820: #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 task:ftest01 state:D stack: 0 pid: 821 ppid: 816 flags:0x00000002 Call Trace: <TASK> __schedule+0x329/0xae0 ? bit_wait+0x60/0x60 schedule+0x48/0xc0 io_schedule+0x16/0x40 bit_wait_io+0x11/0x60 __wait_on_bit_lock+0x5e/0xb0 out_of_line_wait_on_bit_lock+0x93/0xb0 ? complete_all+0x60/0x60 __lock_buffer+0x3f/0x50 do_get_write_access+0x2ad/0x550 jbd2_journal_get_write_access+0x51/0x80 __ext4_journal_get_write_access+0xf9/0x1a0 ext4_ext_get_access.isra.32+0x34/0x50 ext4_ext_remove_space+0x59f/0x1830 ? ext4_ext_truncate+0xa2/0xf0 ? ext4_ext_truncate+0x5e/0xf0 ext4_ext_truncate+0xd6/0xf0 ext4_truncate+0x46a/0x660 ext4_setattr+0x6db/0xa40 notify_change+0x3e5/0x540 do_truncate+0x7b/0xc0 ? do_truncate+0x7b/0xc0 do_sys_ftruncate+0x164/0x2f0 __x64_sys_ftruncate+0x1b/0x20 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f9dae3f7b RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003 RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0 R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003 </TASK> 5 locks held by ftest01/821: #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20 #1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0 #2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40 #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 -------------------------------------- Reproducer: Get latest preempt-rt kernel: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch) make defconfig manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT make oldconfig and take defaults for other options hidden behind EXPERT can optionally enable LOCKDEP options so sysrq-d works. Get LTP ; build + install https://github.com/linux-test-project/ltp.git Boot and then run reproducer: dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512 mkfs.ext4 -F /root/tmp_mount_file mount /root/tmp_mount_file /mnt/ /opt/ltp/runltp -f fs -s ftest01 -d /mnt/ umount /mnt loop device isn't strictly necessary but contains fs fallout when inevitable reboot is required to get rid of D state processes. Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total. Test hangs forever on v5.15-rt and latest v5.19-rc-rt. Test threads in D state - any other i/o directed at /mnt also hangs in D state.
On 2022-07-08 17:27:02 [+0100], John Keeping wrote: > Looking at the history of the -rt patchset, in fact this change was > present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part > of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b > ("sched/core: Rework the __schedule() preempt argument"). Thanks for spotting that. The plan was to get rid of this with the introduction of schedule_rtlock() but it somehow remained resilient. Sebastian
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h index e5af028c08b49..994c25640e156 100644 --- a/include/linux/sched/rt.h +++ b/include/linux/sched/rt.h @@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p) } extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task); extern void rt_mutex_adjust_pi(struct task_struct *p); -static inline bool tsk_is_pi_blocked(struct task_struct *tsk) -{ - return tsk->pi_blocked_on != NULL; -} #else static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task) { return NULL; } # define rt_mutex_adjust_pi(p) do { } while (0) -static inline bool tsk_is_pi_blocked(struct task_struct *tsk) -{ - return false; -} #endif extern void normalize_rt_tasks(void); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1d4660a1915b3..71d6385ece83f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6578,8 +6578,12 @@ static inline void sched_submit_work(struct task_struct *tsk) io_wq_worker_sleeping(tsk); } - if (tsk_is_pi_blocked(tsk)) - return; + /* + * spinlock and rwlock must not flush block requests. This will + * deadlock if the callback attempts to acquire a lock which is + * already acquired. + */ + SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT); /* * If we are going to sleep and we have plugged IO queued,