diff mbox series

[4.4,4.9,v1,2/2] mm: slub: allocate_slab() enables IRQ right after scheduler starts

Message ID 1738629964-11977-3-git-send-email-kazuhiro3.hayashi@toshiba.co.jp
State New
Headers show
Series Fix repeated WARNING in unpin_current_cpu() | expand

Commit Message

Kazuhiro Hayashi Feb. 4, 2025, 12:46 a.m. UTC
This patch resolves problem in 4.4 & 4.9 PREEMPT_RT kernels that
the following WARNING happens repeatedly due to broken context
caused by running slab allocation with IRQ disabled by mistake.

WARNING: CPU: * PID: ** at */kernel/cpu.c:197 unpin_current_cpu+0x60/0x70()

The system is almost unresponsive and the boot stalls once it occurs.
This repeated WARNING only happens while kernel is booting
(before reaches the userland) with a quite low reproducibility:
Only one time in around 1,000 ~ 10,000 reboots.

[Problem details]

On PREEMPT_RT kernels < v4.14-rt, after __slab_alloc() disables
IRQ with local_irq_save(), allocate_slab() is responsible for
re-enabling IRQ only under the specific conditions:

(1) gfpflags_allow_blocking(flags) OR
(2) system_state == SYSTEM_RUNNING

The problem happens when (1) is false AND system_state == SYSTEM_BOOTING,
caused by the following scenario:

1. Some kernel codes invokes the allocator without __GFP_DIRECT_RECLAIM
   bit (i.e. blocking not allowed) while SYSTEM_BOOTING
2. allocate_slab() calls the following functions with IRQ disabled
3. buffered_rmqueue() invokes local_[spin_]lock_irqsave(pa_lock) which
   might call schedule() and enable IRQ, if it failed to get pa_lock
4. The migrate_disable counter, which is not intended to be updated with
   IRQs disabled, is accidentally updated after schedule() then
   migrate_enable() raises WARN_ON_ONCE(p->migrate_disable <= 0)
5. The unpin_current_cpu() WARNING is raised eventually because the
   refcount counter is linked to the migrate_disable counter

The behavior 2-5 above has been obsereved[1] using ftrace.
The condition (2) above intends to make the memory allocator fully
preemptible on PREEMPT_RT kernels[2], so the lock function in the
step 3 above should work if SYSTEM_RUNNING but not if SYSTEM_BOOTING.

[How this is resolved in newer RT kernels]

A patch series in the mainline (v4.13) introduces SYSTEM_SCHEDULING[3].
On top of this, v4.14-rt (6cec8467) changes the condition (2) above:

-	if (system_state == SYSTEM_RUNNING)
+	if (system_state > SYSTEM_BOOTING)

This avoids the problem by enabling IRQ after SYSTEM_SCHEULDING.
Thus, the conditions that allocate_slab() enables IRQ are like:

(2)system_state   v4.9-rt or before  v4.14-rt or later
SYSTEM_BOOTING        (1)==true          (1)==true
                          :                  :
                          :                  v
SYSTEM_SCHEDULING         : < Problem      Always
                          v < occurs here    |
SYSTEM_RUNNING          Always               |
                          |                  |
                          v                  v

[How this patch works]

An simple option would be to backport the series[3], which is possible
and has been verified[4]. However, that series pulls functional
changes like SYSTEM_SCHEDULING and adjustments for it,
early might_sleep() and smp_processor_id() supports, etc.
Therefore, this patch uses an extra (but not mainline) flag
"system_scheduling" provided by the prior patch instead of
introducing SYSTEM_SCHEDULING, then uses the same condition as
newer RT kernels in allocate_slab().

This patch also applies the fix in v5.4-rt (7adf5bc5) to care
SYSTEM_SUSPEND in the condition check.

[1] https://lore.kernel.org/all/TYCPR01MB11385E3CDF05544B63F7EF9C1E1622@TYCPR01MB11385.jpnprd01.prod.outlook.com/
[2] https://docs.kernel.org/locking/locktypes.html#raw-spinlock-t-on-rt
[3] https://lore.kernel.org/all/20170516184231.564888231@linutronix.de/T/
[4] https://lore.kernel.org/all/TYCPR01MB1138579CA7612B568BB880652E1272@TYCPR01MB11385.jpnprd01.prod.outlook.com/

Signed-off-by: Kazuhiro Hayashi <kazuhiro3.hayashi@toshiba.co.jp>
Reviewed-by: Pavel Machek <pavel@denx.de>
---
 mm/slub.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Sebastian Andrzej Siewior Feb. 4, 2025, 8:18 a.m. UTC | #1
On 2025-02-04 09:46:04 [+0900], Kazuhiro Hayashi wrote:
> On PREEMPT_RT kernels < v4.14-rt, after __slab_alloc() disables
> IRQ with local_irq_save(), allocate_slab() is responsible for
> re-enabling IRQ only under the specific conditions:
> 
> (1) gfpflags_allow_blocking(flags) OR
> (2) system_state == SYSTEM_RUNNING
>
> The problem happens when (1) is false AND system_state == SYSTEM_BOOTING,
> caused by the following scenario:
> 
> 1. Some kernel codes invokes the allocator without __GFP_DIRECT_RECLAIM
>    bit (i.e. blocking not allowed) while SYSTEM_BOOTING
> 2. allocate_slab() calls the following functions with IRQ disabled
> 3. buffered_rmqueue() invokes local_[spin_]lock_irqsave(pa_lock) which
>    might call schedule() and enable IRQ, if it failed to get pa_lock
> 4. The migrate_disable counter, which is not intended to be updated with
>    IRQs disabled, is accidentally updated after schedule() then
>    migrate_enable() raises WARN_ON_ONCE(p->migrate_disable <= 0)
> 5. The unpin_current_cpu() WARNING is raised eventually because the
>    refcount counter is linked to the migrate_disable counter

I think the problem is that system_state _is_ SYSTEM_BOOTING but
scheduling and/or an additional CPU is already enabled. This is why the
lock is contended. Otherwise you wouldn't get that far. 

> The behavior 2-5 above has been obsereved[1] using ftrace.
> The condition (2) above intends to make the memory allocator fully
> preemptible on PREEMPT_RT kernels[2], so the lock function in the
> step 3 above should work if SYSTEM_RUNNING but not if SYSTEM_BOOTING.
> 
> [How this is resolved in newer RT kernels]
> 
> A patch series in the mainline (v4.13) introduces SYSTEM_SCHEDULING[3].
> On top of this, v4.14-rt (6cec8467) changes the condition (2) above:
> 
> -	if (system_state == SYSTEM_RUNNING)
> +	if (system_state > SYSTEM_BOOTING)
> 
> This avoids the problem by enabling IRQ after SYSTEM_SCHEULDING.

Yes. Once scheduling is enabled any sleeping lock might be contended so
interrupts must be enabled. This is not done done unconditionally
because early boot code expects to run with disabled interrupts.
This "system_state == SYSTEM_RUNNING" looked like a sane state in
between. However, once the scheduler is up and running, interrupts in
the system are enabled and so the slub allocator needs to always enable
interrupts.

> Thus, the conditions that allocate_slab() enables IRQ are like:
> 
> (2)system_state   v4.9-rt or before  v4.14-rt or later
> SYSTEM_BOOTING        (1)==true          (1)==true
>                           :                  :
>                           :                  v
> SYSTEM_SCHEDULING         : < Problem      Always
>                           v < occurs here    |
> SYSTEM_RUNNING          Always               |
>                           |                  |
>                           v                  v
> 
> [How this patch works]
> 
> An simple option would be to backport the series[3], which is possible
> and has been verified[4]. However, that series pulls functional
> changes like SYSTEM_SCHEDULING and adjustments for it,
> early might_sleep() and smp_processor_id() supports, etc.
> Therefore, this patch uses an extra (but not mainline) flag
> "system_scheduling" provided by the prior patch instead of
> introducing SYSTEM_SCHEDULING, then uses the same condition as
> newer RT kernels in allocate_slab().

The proposal looks okay. However the verified upstream version not only
addresses your issue but also makes smp_processor_id() and might_sleep()
work in the early phase. I would prefer the upstream solution for those
two reasons.

Sebastian
Pavel Machek Feb. 5, 2025, 12:55 p.m. UTC | #2
Hi!

> > An simple option would be to backport the series[3], which is possible
> > and has been verified[4]. However, that series pulls functional
> > changes like SYSTEM_SCHEDULING and adjustments for it,
> > early might_sleep() and smp_processor_id() supports, etc.
> > Therefore, this patch uses an extra (but not mainline) flag
> > "system_scheduling" provided by the prior patch instead of
> > introducing SYSTEM_SCHEDULING, then uses the same condition as
> > newer RT kernels in allocate_slab().
> 
> The proposal looks okay. However the verified upstream version not only
> addresses your issue but also makes smp_processor_id() and might_sleep()
> work in the early phase. I would prefer the upstream solution for those
> two reasons.

So... first, thanks for review. This is mostly my fault, Kazuhiro
Hayashi did full backport, but I asked for minimal version.

AFAICT there's no place for stable-rt to apply this, so we'll just be
taking this to our trees in CIP project. (And I prefer smaller
version. Additional checking is nice for development but not so nice
this late in development cycle).

Best regards,
								Pavel
Kazuhiro Hayashi Feb. 10, 2025, 7:20 a.m. UTC | #3
Hello,

Thank you for your comment!

> > -	if (system_state == SYSTEM_RUNNING)
> > +	if (system_state > SYSTEM_BOOTING)
> >
> > This avoids the problem by enabling IRQ after SYSTEM_SCHEULDING.
> 
> Yes. Once scheduling is enabled any sleeping lock might be contended so
> interrupts must be enabled. This is not done done unconditionally
> because early boot code expects to run with disabled interrupts.
> This "system_state == SYSTEM_RUNNING" looked like a sane state in
> between. However, once the scheduler is up and running, interrupts in
> the system are enabled and so the slub allocator needs to always enable
> interrupts.

Appreciate sharing the background.
This is helpful to understand what was happening around v4.14-rt.

> The proposal looks okay. However the verified upstream version not only
> addresses your issue but also makes smp_processor_id() and might_sleep()
> work in the early phase. I would prefer the upstream solution for those
> two reasons.

Thanks for your opinion.
Just for your reference, I've shared another (revised) patch series[1]
which is based on the upstream series[2]. It has been tested on
several environments similarly to the draft version[3].

Yes, the current proposal only focuses on resolving the WARNING issue.
At the same time, I think it would be also possible to apply the other
improvements for smp_processor_id() and might_sleep(), which means
1/17 and 17/17 of the mainline patch series[2], on top of the current way
i.e. using the custom flag system_scheduling.

The question here is whether to insert SYSTEM_SCHEDULING is acceptable
or not in LTS phase. This addition would change behavior of future fixes
or out-of-tree codes that check system_state, so adjustments similar to
the mainline series[2] (2/17 ~ 15/17) are needed for them.
Can I ask if it's recommended to introduce SYSTEM_SCHEDULING even in LTS?
It's v4.4-rt specific topic but I think similar discussion is regularly
happening in newer -rt branches and LTS branches where PREEMPT_RT is merged.

[1] https://lore.kernel.org/linux-rt-devel/1739170545-25011-1-git-send-email-kazuhiro3.hayashi@toshiba.co.jp/T/
[2] https://lore.kernel.org/all/20170516184231.564888231@linutronix.de/T/
[2] https://lore.kernel.org/all/TYCPR01MB1138579CA7612B568BB880652E1272@TYCPR01MB11385.jpnprd01.prod.outlook.com/

Best regards,
Kazu
Sebastian Andrzej Siewior Feb. 10, 2025, 10:18 a.m. UTC | #4
On 2025-02-10 07:20:35 [+0000], kazuhiro3.hayashi@toshiba.co.jp wrote:
> Hello,
Hi,

> The question here is whether to insert SYSTEM_SCHEDULING is acceptable
> or not in LTS phase. This addition would change behavior of future fixes
> or out-of-tree codes that check system_state, so adjustments similar to
> the mainline series[2] (2/17 ~ 15/17) are needed for them.
> Can I ask if it's recommended to introduce SYSTEM_SCHEDULING even in LTS?
> It's v4.4-rt specific topic but I think similar discussion is regularly
> happening in newer -rt branches and LTS branches where PREEMPT_RT is merged.

In general I try to have the same code if it is easily possible. A few
examples where this was not the case:
- The printk code, as work is in progress, was never fully backported.
  The changes between kernel versions were small and there was little to
  none user visible changes. However changes under the hood were usually
  big so in general it was not worth the effort.

- scheduling wise, we went from PREEMPT_LAZY to PREEMPT_AUTO to
  LAZY_PREEMPT. Fixes within one implementation went all the way down
  but the implementation as a whole was never backported. PREEMPT_AUTO
  could be considered as half way done but nobody complained about
  something in particular. It was "recently" discussed whether or not to
  backport LAZY_PREEMPT to replace PREEMPT_AUTO in some of the lower
  kernels but the changes required to backport it would be huge because
  the scheduler in particular changed. So this is hard to justify.

What questions can you ask?
- Do you backport code that relies on `system_states' or did so in the
  past?
- Is more likely to backport code from before v4.13 or after? Either way
  you need to look at this.

> Best regards,
> Kazu

Sebastian
Kazuhiro Hayashi Feb. 12, 2025, 7:57 a.m. UTC | #5
Hi,

> > [...]
> > Can I ask if it's recommended to introduce SYSTEM_SCHEDULING even in LTS?
> > It's v4.4-rt specific topic but I think similar discussion is regularly
> > happening in newer -rt branches and LTS branches where PREEMPT_RT is merged.
> 
> In general I try to have the same code if it is easily possible. A few
> examples where this was not the case:
> - The printk code, as work is in progress, was never fully backported.
>   The changes between kernel versions were small and there was little to
>   none user visible changes. However changes under the hood were usually
>   big so in general it was not worth the effort.
> 
> - scheduling wise, we went from PREEMPT_LAZY to PREEMPT_AUTO to
>   LAZY_PREEMPT. Fixes within one implementation went all the way down
>   but the implementation as a whole was never backported. PREEMPT_AUTO
>   could be considered as half way done but nobody complained about
>   something in particular. It was "recently" discussed whether or not to
>   backport LAZY_PREEMPT to replace PREEMPT_AUTO in some of the lower
>   kernels but the changes required to backport it would be huge because
>   the scheduler in particular changed. So this is hard to justify.

Thank you for the examples!
In the both cases, it seems changes for backports are big and they affect
other backports which will happen in the future.
Comparing to those examples, I understood the SYSTEM_SCHEDULING case as
the one that backport should be considered and tried.

> What questions can you ask?

As it's clear that to backport the series[1] including SYSTEM_SCHEDULING
is preferred based on the interpretation above, I have no further query now.

> - Do you backport code that relies on `system_states' or did so in the
>   past?

I have not tried to backport code (1/17 & 17/17) that relies on
`system_scheduling` (custom variable) yet. No plan to do so anymore
as long as we can select to backport the series[1].

I have backported the full series[1] that relies on `SYSTEM_SCHEDULING`
(used as a value of `system_state` in upstream). It has been shared as [2].
CIP (4.4 RT) will decide if this backported series will be applied
based on all the feedbacks here.

> - Is more likely to backport code from before v4.13 or after? Either way
>   you need to look at this.

It's more likely to backport code from v4.13 that introduce the series[1].
The reason is IRQ controls in slub.c are changed significantly from v5.15-rt
and we concluded that it's hard to backport the changes into v4.4-rt[3].
Another mainline series[4] replaces most IRQ enable/disable codes in
slub.c with local_lock_irqsave() and local_unlock_irqrestore().
Eventually, `system_state` is no longer evaluated in SLAB functions at all
like the previous __slab_alloc().


Pavel,
Feel free to append information if you need.
Can we select the (revised) full backport[2] or anything else?
I'd like to know we are in the same page.

[1] https://lore.kernel.org/all/20170516184231.564888231@linutronix.de/T/
[2] https://lore.kernel.org/linux-rt-devel/1739170545-25011-1-git-send-email-kazuhiro3.hayashi@toshiba.co.jp/T/
[3] https://lore.kernel.org/all/TYCPR01MB11385C4C5CA765090E66944FFE1052@TYCPR01MB11385.jpnprd01.prod.outlook.com/
[4] https://lore.kernel.org/lkml/20210904105003.11688-1-vbabka@suse.cz/

Best regards,
Kazu
diff mbox series

Patch

diff --git a/mm/slub.c b/mm/slub.c
index fd23ff951395..6186a2586289 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1412,7 +1412,8 @@  static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	if (gfpflags_allow_blocking(flags))
 		enableirqs = true;
 #ifdef CONFIG_PREEMPT_RT_FULL
-	if (system_state == SYSTEM_RUNNING)
+	/* SYSTEM_SCHEDULING <= system_state < SYSTEM_SUSPEND in the mainline */
+	if (system_scheduling && system_state < SYSTEM_SUSPEND)
 		enableirqs = true;
 #endif
 	if (enableirqs)