diff mbox series

[v3,2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()

Message ID 20241108122909.763663-3-patryk.wlazlyn@linux.intel.com
State New
Headers show
Series SRF: Fix offline CPU preventing pc6 entry | expand

Commit Message

Patryk Wlazlyn Nov. 8, 2024, 12:29 p.m. UTC
The generic implementation, based on cpuid leaf 0x5, for looking up the
mwait hint for the deepest cstate, depends on them to be continuous in
range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
platforms, it is not architectural and may not result in reaching the
most optimized idle state on some of them.

Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
fallback to the later in case of missing enter_dead() handler.

Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
---
 arch/x86/kernel/smpboot.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Gautham R. Shenoy Nov. 13, 2024, 11:41 a.m. UTC | #1
On Tue, Nov 12, 2024 at 03:56:18PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 12, 2024 at 02:23:14PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote:
> > > > The generic implementation, based on cpuid leaf 0x5, for looking up the
> > > > mwait hint for the deepest cstate, depends on them to be continuous in
> > > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
> > > > platforms, it is not architectural and may not result in reaching the
> > > > most optimized idle state on some of them.
> > > >
> > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
> > > > fallback to the later in case of missing enter_dead() handler.
> > > >
> > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
> > > > ---
> > > >  arch/x86/kernel/smpboot.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > > > index 44c40781bad6..721bb931181c 100644
> > > > --- a/arch/x86/kernel/smpboot.c
> > > > +++ b/arch/x86/kernel/smpboot.c
> > > > @@ -1416,9 +1416,9 @@ void native_play_dead(void)
> > > >       play_dead_common();
> > > >       tboot_shutdown(TB_SHUTDOWN_WFS);
> > > >
> > > > -     mwait_play_dead();
> > > >       if (cpuidle_play_dead())
> > > > -             hlt_play_dead();
> > > > +             mwait_play_dead();
> > > > +     hlt_play_dead();
> > > >  }
> > >
> > > Yeah, I don't think so. we don't want to accidentally hit
> > > acpi_idle_play_dead().
> > 
> > Having inspected the code once again, I'm not sure what your concern is.
> > 
> > :enter.dead() is set to acpi_idle_play_dead() for all states in ACPI
> > idle - see acpi_processor_setup_cstates() and the role of the type
> > check is to filter out bogus table entries (the "type" must be 1, 2,
> > or 3 as per the spec).
> > 
> > Then cpuidle_play_dead() calls drv->states[i].enter_dead() for the
> > deepest state where it is set and if this is FFH,
> > acpi_idle_play_dead() will return an error.  So after the change, the
> > code above will fall back to mwait_play_dead() then.
> > 
> > Or am I missing anything?
> 
> So it relies on there being a C2/C3 state enumerated and that being FFh.
> Otherwise it will find a 'working' state and we're up a creek.
> 
> Typically I expect C2/C3 FFh states will be there on Intel stuff, but it
> seems awefully random to rely on this hole. AMD might unwittinly change
> the ACPI driver (they're the main user) and then we'd be up a creek.

AMD platforms won't be using FFH based states for offlined CPUs. We
prefer IO based states when available, and HLT otherwise.

> 
> Robustly we'd teach the ACPI driver about FFh and set enter_dead on
> every state -- but we'd have to double check that with AMD.

Works for us as long as those FFh states aren't used for play_dead on
AMD platforms.

How about something like this (completely untested)

---------------------x8----------------------------------------------------

diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index f3ffd0a3a012..bd611771fa6c 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 }
 EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
 
+static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
+{
+       unsigned int cpu = smp_processor_id();
+       struct cstate_entry *percpu_entry;
+
+       /*
+        * This is ugly. But AMD processors don't prefer MWAIT based
+        * C-states when processors are offlined.
+        */
+       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
+           boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
+               return -ENODEV;
+
+       percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
+       return mwait_play_dead_with_hints(percpu_entry->states[cx->index].eax);
+}
+EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
+
 static int __init ffh_cstate_init(void)
 {
        struct cpuinfo_x86 *c = &boot_cpu_data;
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 831fa4a12159..c535f5df9081 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -590,6 +590,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
                        raw_safe_halt();
                else if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) {
                        io_idle(cx->address);
+               } else if (cx->entry_method == ACPI_CSTATE_FFH) {
+                       return acpi_procesor_ffh_play_dead(cx);
                } else
                        return -ENODEV;
        }
diff --git a/include/acpi/processor.h b/include/acpi/processor.h
index e6f6074eadbf..38329dcdd2b9 100644
--- a/include/acpi/processor.h
+++ b/include/acpi/processor.h
@@ -280,6 +280,7 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
                                    struct acpi_processor_cx *cx,
                                    struct acpi_power_register *reg);
 void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cstate);
+int acpi_processor_ffh_dead(struct acpi_processor_cx *cstate);
 #else
 static inline void acpi_processor_power_init_bm_check(struct
                                                      acpi_processor_flags
@@ -300,6 +301,11 @@ static inline void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx
 {
        return;
 }
+
+static inline acpi_processor_ffh_dead(struct acpi_processor_cx *cstate)
+{
+       return -ENODEV;
+}
 #endif
 
 static inline int call_on_cpu(int cpu, long (*fn)(void *), void *arg,

---------------------x8--------------------------------------------------------

--
Thanks and Regards
gautham.
Dave Hansen Nov. 13, 2024, 4:14 p.m. UTC | #2
On 11/13/24 03:41, Gautham R. Shenoy wrote:
> +       /*
> +        * This is ugly. But AMD processors don't prefer MWAIT based
> +        * C-states when processors are offlined.
> +        */
> +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> +           boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> +               return -ENODEV;

Can we get an X86_FEATURE for this, please?  Either a positive one:

	X86_FEATURE_MWAIT_OK_FOR_OFFLINE

or a negative one:

	X86_FEATURE_MWAIT_BUSTED_FOR_OFFLINE

... with better names.

Or even a helper.  Because if you add this AMD||HYGON check, it'll be at
_least_ the second one of these for the same logical reason.
Peter Zijlstra Nov. 13, 2024, 4:22 p.m. UTC | #3
On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:

> How about something like this (completely untested)
> 
> ---------------------x8----------------------------------------------------
> 
> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> index f3ffd0a3a012..bd611771fa6c 100644
> --- a/arch/x86/kernel/acpi/cstate.c
> +++ b/arch/x86/kernel/acpi/cstate.c
> @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
>  }
>  EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
>  
> +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> +{
> +       unsigned int cpu = smp_processor_id();
> +       struct cstate_entry *percpu_entry;
> +
> +       /*
> +        * This is ugly. But AMD processors don't prefer MWAIT based
> +        * C-states when processors are offlined.
> +        */
> +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> +           boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> +               return -ENODEV;

Are there AMD systems with FFh idle states at all?

Also, I don't think this works right, the loop in cpuidle_play_dead()
will terminate on this, and not try a shallower state which might have
IO/HLT on.

> +
> +       percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
> +       return mwait_play_dead_with_hints(percpu_entry->states[cx->index].eax);
> +}
> +EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
Wysocki, Rafael J Nov. 13, 2024, 4:27 p.m. UTC | #4
On 11/13/2024 5:22 PM, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
>
>> How about something like this (completely untested)
>>
>> ---------------------x8----------------------------------------------------
>>
>> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
>> index f3ffd0a3a012..bd611771fa6c 100644
>> --- a/arch/x86/kernel/acpi/cstate.c
>> +++ b/arch/x86/kernel/acpi/cstate.c
>> @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
>>   }
>>   EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
>>   
>> +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
>> +{
>> +       unsigned int cpu = smp_processor_id();
>> +       struct cstate_entry *percpu_entry;
>> +
>> +       /*
>> +        * This is ugly. But AMD processors don't prefer MWAIT based
>> +        * C-states when processors are offlined.
>> +        */
>> +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
>> +           boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
>> +               return -ENODEV;
> Are there AMD systems with FFh idle states at all?

I don't know.


> Also, I don't think this works right, the loop in cpuidle_play_dead()
> will terminate on this, and not try a shallower state which might have
> IO/HLT on.

I think you are right.


>> +
>> +       percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
>> +       return mwait_play_dead_with_hints(percpu_entry->states[cx->index].eax);
>> +}
>> +EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
Gautham R. Shenoy Nov. 14, 2024, 5:06 a.m. UTC | #5
Hello Dave,
On Wed, Nov 13, 2024 at 08:14:08AM -0800, Dave Hansen wrote:
> On 11/13/24 03:41, Gautham R. Shenoy wrote:
> > +       /*
> > +        * This is ugly. But AMD processors don't prefer MWAIT based
> > +        * C-states when processors are offlined.
> > +        */
> > +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> > +           boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> > +               return -ENODEV;
> 
> Can we get an X86_FEATURE for this, please?  Either a positive one:
> 
> 	X86_FEATURE_MWAIT_OK_FOR_OFFLINE
> 
> or a negative one:
> 
> 	X86_FEATURE_MWAIT_BUSTED_FOR_OFFLINE
>

Sure. That makes sense.

> ... with better names.
> 
> Or even a helper.  Because if you add this AMD||HYGON check, it'll be at
> _least_ the second one of these for the same logical reason.

Fair enough. Will add that.

--
Thanks and Regards
gautham.
Peter Zijlstra Nov. 14, 2024, 11:58 a.m. UTC | #6
On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:

> AMD platforms won't be using FFH based states for offlined CPUs. We
> prefer IO based states when available, and HLT otherwise.
> 
> > 
> > Robustly we'd teach the ACPI driver about FFh and set enter_dead on
> > every state -- but we'd have to double check that with AMD.
> 
> Works for us as long as those FFh states aren't used for play_dead on
> AMD platforms.

AFAIU AMD doesn't want to use MWAIT -- ever, not only for offline.
Confirm?

But if it were to use MWAIT for regular idle, then surely it's OK for
offline too, right?
Rafael J. Wysocki Nov. 14, 2024, 11:58 a.m. UTC | #7
On Wed, Nov 13, 2024 at 5:27 PM Wysocki, Rafael J
<rafael.j.wysocki@intel.com> wrote:
>
>
> On 11/13/2024 5:22 PM, Peter Zijlstra wrote:
> > On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> >
> >> How about something like this (completely untested)
> >>
> >> ---------------------x8----------------------------------------------------
> >>
> >> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> >> index f3ffd0a3a012..bd611771fa6c 100644
> >> --- a/arch/x86/kernel/acpi/cstate.c
> >> +++ b/arch/x86/kernel/acpi/cstate.c
> >> @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> >>   }
> >>   EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
> >>
> >> +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> >> +{
> >> +       unsigned int cpu = smp_processor_id();
> >> +       struct cstate_entry *percpu_entry;
> >> +
> >> +       /*
> >> +        * This is ugly. But AMD processors don't prefer MWAIT based
> >> +        * C-states when processors are offlined.
> >> +        */
> >> +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> >> +           boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> >> +               return -ENODEV;
> > Are there AMD systems with FFh idle states at all?
>
> I don't know.
>
>
> > Also, I don't think this works right, the loop in cpuidle_play_dead()
> > will terminate on this, and not try a shallower state which might have
> > IO/HLT on.
>
> I think you are right.

AFAICS, cpuidle_play_dead() needs to be reworked to not bail out after
receiving an error from :play_dead() for one state, but only after all
of them have failed.

> >> +
> >> +       percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
> >> +       return mwait_play_dead_with_hints(percpu_entry->states[cx->index].eax);
> >> +}
> >> +EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
Peter Zijlstra Nov. 14, 2024, 12:03 p.m. UTC | #8
On Tue, Nov 12, 2024 at 03:01:27PM +0100, Peter Zijlstra wrote:

> No, not mwait hint. We need an instruction that:
> 
>  - goes to deepest C state
>  - drops into WAIT-for-Start-IPI (SIPI)
> 
> Notably, it should not wake from:
> 
>  - random memory writes
>  - NMI, MCE, SMI and other such non-maskable thingies
>  - anything else -- the memory pointed to by RIP might no longer exist
> 
> Lets call the instruction: DEAD.

So, turns out that when you send INIT to an AP it does the whole drop
into Wait-for-SIPI and ignore non-maskable crap.

The reason we don't do that is because INIT to CPU0 (BP) is somewhat
fatal, but since Thomas killed all that CPU0 hotplug crap, I think we
can actually go do that.
Peter Zijlstra Nov. 14, 2024, 12:17 p.m. UTC | #9
On Thu, Nov 14, 2024 at 12:58:49PM +0100, Rafael J. Wysocki wrote:
> On Wed, Nov 13, 2024 at 5:27 PM Wysocki, Rafael J
> <rafael.j.wysocki@intel.com> wrote:
> >
> >
> > On 11/13/2024 5:22 PM, Peter Zijlstra wrote:
> > > On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> > >
> > >> How about something like this (completely untested)
> > >>
> > >> ---------------------x8----------------------------------------------------
> > >>
> > >> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> > >> index f3ffd0a3a012..bd611771fa6c 100644
> > >> --- a/arch/x86/kernel/acpi/cstate.c
> > >> +++ b/arch/x86/kernel/acpi/cstate.c
> > >> @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> > >>   }
> > >>   EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
> > >>
> > >> +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> > >> +{
> > >> +       unsigned int cpu = smp_processor_id();
> > >> +       struct cstate_entry *percpu_entry;
> > >> +
> > >> +       /*
> > >> +        * This is ugly. But AMD processors don't prefer MWAIT based
> > >> +        * C-states when processors are offlined.
> > >> +        */
> > >> +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> > >> +           boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> > >> +               return -ENODEV;
> > > Are there AMD systems with FFh idle states at all?
> >
> > I don't know.
> >
> >
> > > Also, I don't think this works right, the loop in cpuidle_play_dead()
> > > will terminate on this, and not try a shallower state which might have
> > > IO/HLT on.
> >
> > I think you are right.
> 
> AFAICS, cpuidle_play_dead() needs to be reworked to not bail out after
> receiving an error from :play_dead() for one state, but only after all
> of them have failed.

That and ideally we remove that whole ACPI_STATE_C[123] filter on
setting enter_dead. I don't see how that makes sense.
Gautham R. Shenoy Nov. 14, 2024, 5:24 p.m. UTC | #10
Hello Peter,

On Thu, Nov 14, 2024 at 12:58:31PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> 
> > AMD platforms won't be using FFH based states for offlined CPUs. We
> > prefer IO based states when available, and HLT otherwise.
> > 
> > > 
> > > Robustly we'd teach the ACPI driver about FFh and set enter_dead on
> > > every state -- but we'd have to double check that with AMD.
> > 
> > Works for us as long as those FFh states aren't used for play_dead on
> > AMD platforms.
> 
> AFAIU AMD doesn't want to use MWAIT -- ever, not only for offline.
> Confirm?
> 

AMD wants to use MWAIT for cpuidle and it does use MWAIT based C1
state on both client and server parts.

Eg: On my server box

$ cpupower idle-info  | grep "FFH" -B1 -A3
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 6591
Duration: 1482606

> But if it were to use MWAIT for regular idle, then surely it's OK for
> offline too, right?

I tried this out today and there is no functional issue.

However, I would like to run some experiments on whether HLT provides
better power savings than MWAIT C1 with CPUs offlined. I will get back
with this information tomorrow.

--
Thanks and Regards
gautham.
Gautham R. Shenoy Nov. 14, 2024, 5:36 p.m. UTC | #11
On Wed, Nov 13, 2024 at 05:22:22PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> 
> > How about something like this (completely untested)
> > 
> > ---------------------x8----------------------------------------------------
> > 
> > diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> > index f3ffd0a3a012..bd611771fa6c 100644
> > --- a/arch/x86/kernel/acpi/cstate.c
> > +++ b/arch/x86/kernel/acpi/cstate.c
> > @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> >  }
> >  EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
> >  
> > +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> > +{
> > +       unsigned int cpu = smp_processor_id();
> > +       struct cstate_entry *percpu_entry;
> > +
> > +       /*
> > +        * This is ugly. But AMD processors don't prefer MWAIT based
> > +        * C-states when processors are offlined.
> > +        */
> > +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> > +           boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> > +               return -ENODEV;
> 
> Are there AMD systems with FFh idle states at all?

Yes. On AMD systems, the ACPI C1 state is an FFh idle state. An
example from my server box.


$ cpupower idle-info  | grep "FFH" -B1 -A3
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 6591
Duration: 1482606

> 
> Also, I don't think this works right, the loop in cpuidle_play_dead()
> will terminate on this, and not try a shallower state which might have
> IO/HLT on.

Ah yes. You are right. I didn't observe that cpuidle_play_dead() calls
"return idle_state.enter_state". I suppose the solution would be to
not populate .enter_dead callback for FFH based C-States on AMD/Hygon
Platforms.

How about something like the following? I have tested this on AMD
servers by disabling the IO based C2 state, and I could observe that
the offlined CPUs entered HLT bypassing the MWAIT based C1. When IO
based C2 state is enabled, offlined CPUs enter the C2 state as before.
If Global C-States are disabled, then, offlined CPUs enter HLT.

This is on top of Patrick's series.

I have defined X86_FEATURE_NO_MWAIT_OFFLINE as suggested by Dave to
prevent MWAIT based C-states being used for CPU Offline on AMD and
Hygon.

---------------------x8----------------------------------------------------
From: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Date: Thu, 14 Nov 2024 14:48:17 +0530
Subject: [PATCH] acpi_idle: Teach acpi_idle_play_dead about FFH states

Signed-off-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/kernel/acpi/cstate.c      | 14 ++++++++++++++
 arch/x86/kernel/smpboot.c          |  6 +++---
 drivers/acpi/processor_idle.c      | 24 ++++++++++++++++++++++--
 include/acpi/processor.h           |  6 ++++++
 5 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 05e985ce9880..ceb002406d0c 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -236,6 +236,7 @@
 #define X86_FEATURE_PVUNLOCK		( 8*32+20) /* PV unlock function */
 #define X86_FEATURE_VCPUPREEMPT		( 8*32+21) /* PV vcpu_is_preempted function */
 #define X86_FEATURE_TDX_GUEST		( 8*32+22) /* "tdx_guest" Intel Trust Domain Extensions Guest */
+#define X86_FEATURE_NO_MWAIT_OFFLINE    ( 8*32+23) /* Don't use MWAIT states for offlined CPUs*/
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* "fsgsbase" RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index f3ffd0a3a012..a5255a603745 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -215,6 +215,16 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 }
 EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
 
+void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
+{
+	unsigned int cpu = smp_processor_id();
+	struct cstate_entry *percpu_entry;
+
+	percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
+	mwait_play_dead_with_hint(percpu_entry->states[cx->index].eax);
+}
+EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
+
 static int __init ffh_cstate_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
@@ -224,6 +234,10 @@ static int __init ffh_cstate_init(void)
 	    c->x86_vendor != X86_VENDOR_HYGON)
 		return -1;
 
+	if (c->x86_vendor == X86_VENDOR_AMD ||
+	    c->x86_vendor == X86_VENDOR_HYGON)
+		set_cpu_cap(c, X86_FEATURE_NO_MWAIT_OFFLINE);
+
 	cpu_cstate_entry = alloc_percpu(struct cstate_entry);
 	return 0;
 }
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 106f26e976b8..b5939bf96be6 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1285,9 +1285,9 @@ static inline int mwait_play_dead(void)
 	unsigned int highest_subcstate = 0;
 	int i;
 
-	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
-	    boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
-		return 1;
+	if (boot_cpu_has(X86_FEATURE_NO_MWAIT_OFFLINE))
+		return -ENODEV;
+
 	if (!this_cpu_has(X86_FEATURE_MWAIT))
 		return 1;
 	if (!this_cpu_has(X86_FEATURE_CLFLUSH))
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 831fa4a12159..1e2259b4395b 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -590,6 +590,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
 			raw_safe_halt();
 		else if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) {
 			io_idle(cx->address);
+		} else if (cx->entry_method == ACPI_CSTATE_FFH) {
+			acpi_processor_ffh_play_dead(cx);
 		} else
 			return -ENODEV;
 	}
@@ -772,6 +774,25 @@ static int acpi_processor_setup_cpuidle_cx(struct acpi_processor *pr,
 	return 0;
 }
 
+static inline bool valid_acpi_idle_type(struct acpi_processor_cx *cx)
+{
+	if (cx->type == ACPI_STATE_C1 ||
+	    cx->type == ACPI_STATE_C2 ||
+	    cx->type == ACPI_STATE_C3)
+		return true;
+
+	return false;
+}
+
+static inline bool acpi_idle_can_play_dead(struct acpi_processor_cx *cx)
+{
+	if (cx->entry_method == ACPI_CSTATE_FFH &&
+	    boot_cpu_has(X86_FEATURE_NO_MWAIT_OFFLINE))
+		return false;
+
+	return true;
+}
+
 static int acpi_processor_setup_cstates(struct acpi_processor *pr)
 {
 	int i, count;
@@ -803,8 +824,7 @@ static int acpi_processor_setup_cstates(struct acpi_processor *pr)
 		state->enter = acpi_idle_enter;
 
 		state->flags = 0;
-		if (cx->type == ACPI_STATE_C1 || cx->type == ACPI_STATE_C2 ||
-		    cx->type == ACPI_STATE_C3) {
+		if (valid_acpi_idle_type(cx) && acpi_idle_can_play_dead(cx)) {
 			state->enter_dead = acpi_idle_play_dead;
 			if (cx->type != ACPI_STATE_C3)
 				drv->safe_state_index = count;
diff --git a/include/acpi/processor.h b/include/acpi/processor.h
index e6f6074eadbf..349fe47a579e 100644
--- a/include/acpi/processor.h
+++ b/include/acpi/processor.h
@@ -280,6 +280,7 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
 				    struct acpi_processor_cx *cx,
 				    struct acpi_power_register *reg);
 void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cstate);
+void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cstate);
 #else
 static inline void acpi_processor_power_init_bm_check(struct
 						      acpi_processor_flags
@@ -300,6 +301,11 @@ static inline void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx
 {
 	return;
 }
+
+static inline void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cstate)
+{
+	return;
+}
 #endif
 
 static inline int call_on_cpu(int cpu, long (*fn)(void *), void *arg,


--
Thanks and Regards
gautham.
Rafael J. Wysocki Nov. 14, 2024, 5:58 p.m. UTC | #12
On Thu, Nov 14, 2024 at 6:36 PM Gautham R. Shenoy
<gautham.shenoy@amd.com> wrote:
>
> On Wed, Nov 13, 2024 at 05:22:22PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> >
> > > How about something like this (completely untested)
> > >
> > > ---------------------x8----------------------------------------------------
> > >
> > > diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> > > index f3ffd0a3a012..bd611771fa6c 100644
> > > --- a/arch/x86/kernel/acpi/cstate.c
> > > +++ b/arch/x86/kernel/acpi/cstate.c
> > > @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> > >  }
> > >  EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
> > >
> > > +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> > > +{
> > > +       unsigned int cpu = smp_processor_id();
> > > +       struct cstate_entry *percpu_entry;
> > > +
> > > +       /*
> > > +        * This is ugly. But AMD processors don't prefer MWAIT based
> > > +        * C-states when processors are offlined.
> > > +        */
> > > +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> > > +           boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> > > +               return -ENODEV;
> >
> > Are there AMD systems with FFh idle states at all?
>
> Yes. On AMD systems, the ACPI C1 state is an FFh idle state. An
> example from my server box.
>
>
> $ cpupower idle-info  | grep "FFH" -B1 -A3
> C1:
> Flags/Description: ACPI FFH MWAIT 0x0
> Latency: 1
> Usage: 6591
> Duration: 1482606
>
> >
> > Also, I don't think this works right, the loop in cpuidle_play_dead()
> > will terminate on this, and not try a shallower state which might have
> > IO/HLT on.
>
> Ah yes. You are right. I didn't observe that cpuidle_play_dead() calls
> "return idle_state.enter_state". I suppose the solution would be to
> not populate .enter_dead callback for FFH based C-States on AMD/Hygon
> Platforms.

No, no, the solution is to change cpuidle_play_dead() to do the right thing.

Please see

https://lore.kernel.org/linux-pm/4992010.31r3eYUQgx@rjwysocki.net/

that I've just posted.

> How about something like the following? I have tested this on AMD
> servers by disabling the IO based C2 state, and I could observe that
> the offlined CPUs entered HLT bypassing the MWAIT based C1. When IO
> based C2 state is enabled, offlined CPUs enter the C2 state as before.
> If Global C-States are disabled, then, offlined CPUs enter HLT.
>
> This is on top of Patrick's series.
>
> I have defined X86_FEATURE_NO_MWAIT_OFFLINE as suggested by Dave to
> prevent MWAIT based C-states being used for CPU Offline on AMD and
> Hygon.
>
> ---------------------x8----------------------------------------------------
> From: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
> Date: Thu, 14 Nov 2024 14:48:17 +0530
> Subject: [PATCH] acpi_idle: Teach acpi_idle_play_dead about FFH states
>
> Signed-off-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
> ---
>  arch/x86/include/asm/cpufeatures.h |  1 +
>  arch/x86/kernel/acpi/cstate.c      | 14 ++++++++++++++
>  arch/x86/kernel/smpboot.c          |  6 +++---
>  drivers/acpi/processor_idle.c      | 24 ++++++++++++++++++++++--
>  include/acpi/processor.h           |  6 ++++++
>  5 files changed, 46 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 05e985ce9880..ceb002406d0c 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -236,6 +236,7 @@
>  #define X86_FEATURE_PVUNLOCK           ( 8*32+20) /* PV unlock function */
>  #define X86_FEATURE_VCPUPREEMPT                ( 8*32+21) /* PV vcpu_is_preempted function */
>  #define X86_FEATURE_TDX_GUEST          ( 8*32+22) /* "tdx_guest" Intel Trust Domain Extensions Guest */
> +#define X86_FEATURE_NO_MWAIT_OFFLINE    ( 8*32+23) /* Don't use MWAIT states for offlined CPUs*/
>
>  /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
>  #define X86_FEATURE_FSGSBASE           ( 9*32+ 0) /* "fsgsbase" RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> index f3ffd0a3a012..a5255a603745 100644
> --- a/arch/x86/kernel/acpi/cstate.c
> +++ b/arch/x86/kernel/acpi/cstate.c
> @@ -215,6 +215,16 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
>  }
>  EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
>
> +void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> +{
> +       unsigned int cpu = smp_processor_id();
> +       struct cstate_entry *percpu_entry;
> +
> +       percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
> +       mwait_play_dead_with_hint(percpu_entry->states[cx->index].eax);
> +}
> +EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);

Why does it need to be exported to modules?

> +
>  static int __init ffh_cstate_init(void)
>  {
>         struct cpuinfo_x86 *c = &boot_cpu_data;
> @@ -224,6 +234,10 @@ static int __init ffh_cstate_init(void)
>             c->x86_vendor != X86_VENDOR_HYGON)
>                 return -1;
>
> +       if (c->x86_vendor == X86_VENDOR_AMD ||
> +           c->x86_vendor == X86_VENDOR_HYGON)
> +               set_cpu_cap(c, X86_FEATURE_NO_MWAIT_OFFLINE);
> +
>         cpu_cstate_entry = alloc_percpu(struct cstate_entry);
>         return 0;
>  }
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 106f26e976b8..b5939bf96be6 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1285,9 +1285,9 @@ static inline int mwait_play_dead(void)
>         unsigned int highest_subcstate = 0;
>         int i;
>
> -       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> -           boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> -               return 1;
> +       if (boot_cpu_has(X86_FEATURE_NO_MWAIT_OFFLINE))
> +               return -ENODEV;
> +
>         if (!this_cpu_has(X86_FEATURE_MWAIT))
>                 return 1;
>         if (!this_cpu_has(X86_FEATURE_CLFLUSH))
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 831fa4a12159..1e2259b4395b 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -590,6 +590,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
>                         raw_safe_halt();
>                 else if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) {
>                         io_idle(cx->address);
> +               } else if (cx->entry_method == ACPI_CSTATE_FFH) {
> +                       acpi_processor_ffh_play_dead(cx);
>                 } else
>                         return -ENODEV;
>         }
> @@ -772,6 +774,25 @@ static int acpi_processor_setup_cpuidle_cx(struct acpi_processor *pr,
>         return 0;
>  }
>
> +static inline bool valid_acpi_idle_type(struct acpi_processor_cx *cx)
> +{
> +       if (cx->type == ACPI_STATE_C1 ||
> +           cx->type == ACPI_STATE_C2 ||
> +           cx->type == ACPI_STATE_C3)
> +               return true;
> +
> +       return false;

This could do

    return cx->type == ACPI_STATE_C1 || cx->type == ACPI_STATE_C2 ||
cx->type == ACPI_STATE_C3;

but it may just not be necessary.

Please see

https://lore.kernel.org/linux-pm/2373563.ElGaqSPkdT@rjwysocki.net/

> +}
> +
> +static inline bool acpi_idle_can_play_dead(struct acpi_processor_cx *cx)
> +{
> +       if (cx->entry_method == ACPI_CSTATE_FFH &&
> +           boot_cpu_has(X86_FEATURE_NO_MWAIT_OFFLINE))
> +               return false;
> +
> +       return true;
> +}
> +
>  static int acpi_processor_setup_cstates(struct acpi_processor *pr)
>  {
>         int i, count;
> @@ -803,8 +824,7 @@ static int acpi_processor_setup_cstates(struct acpi_processor *pr)
>                 state->enter = acpi_idle_enter;
>
>                 state->flags = 0;
> -               if (cx->type == ACPI_STATE_C1 || cx->type == ACPI_STATE_C2 ||
> -                   cx->type == ACPI_STATE_C3) {
> +               if (valid_acpi_idle_type(cx) && acpi_idle_can_play_dead(cx)) {
>                         state->enter_dead = acpi_idle_play_dead;

The check below is separate from the "enter_dead" check, so it should
not depend on acpi_idle_can_play_dead(cx).

>                         if (cx->type != ACPI_STATE_C3)
>                                 drv->safe_state_index = count;
> diff --git a/include/acpi/processor.h b/include/acpi/processor.h
> index e6f6074eadbf..349fe47a579e 100644
> --- a/include/acpi/processor.h
> +++ b/include/acpi/processor.h
> @@ -280,6 +280,7 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
>                                     struct acpi_processor_cx *cx,
>                                     struct acpi_power_register *reg);
>  void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cstate);
> +void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cstate);
>  #else
>  static inline void acpi_processor_power_init_bm_check(struct
>                                                       acpi_processor_flags
> @@ -300,6 +301,11 @@ static inline void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx
>  {
>         return;
>  }
> +
> +static inline void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cstate)
> +{
> +       return;
> +}
>  #endif
>
>  static inline int call_on_cpu(int cpu, long (*fn)(void *), void *arg,
>
>
> --
> Thanks and Regards
> gautham.
>
Thomas Gleixner Nov. 15, 2024, 1:21 a.m. UTC | #13
On Thu, Nov 14 2024 at 13:03, Peter Zijlstra wrote:
> On Tue, Nov 12, 2024 at 03:01:27PM +0100, Peter Zijlstra wrote:
>
>> No, not mwait hint. We need an instruction that:
>> 
>>  - goes to deepest C state
>>  - drops into WAIT-for-Start-IPI (SIPI)
>> 
>> Notably, it should not wake from:
>> 
>>  - random memory writes
>>  - NMI, MCE, SMI and other such non-maskable thingies
>>  - anything else -- the memory pointed to by RIP might no longer exist
>> 
>> Lets call the instruction: DEAD.
>
> So, turns out that when you send INIT to an AP it does the whole drop
> into Wait-for-SIPI and ignore non-maskable crap.
>
> The reason we don't do that is because INIT to CPU0 (BP) is somewhat
> fatal, but since Thomas killed all that CPU0 hotplug crap, I think we
> can actually go do that.

Instead of playing dead or to kick out CPUs from whatever dead play
routine they are in?

playimg dead is to stay because INIT will bring back the MCE broadcast
problem, which we try to avoid by bringing SMT siblings up just to shut
them down again by playing dead.

You need a MCE broadcast free system and/or some sensible BIOS bringup
code for that to work...

Thanks,

        tglx
Peter Zijlstra Nov. 15, 2024, 10:07 a.m. UTC | #14
On Fri, Nov 15, 2024 at 02:21:19AM +0100, Thomas Gleixner wrote:
> On Thu, Nov 14 2024 at 13:03, Peter Zijlstra wrote:
> > On Tue, Nov 12, 2024 at 03:01:27PM +0100, Peter Zijlstra wrote:
> >
> >> No, not mwait hint. We need an instruction that:
> >> 
> >>  - goes to deepest C state
> >>  - drops into WAIT-for-Start-IPI (SIPI)
> >> 
> >> Notably, it should not wake from:
> >> 
> >>  - random memory writes
> >>  - NMI, MCE, SMI and other such non-maskable thingies
> >>  - anything else -- the memory pointed to by RIP might no longer exist
> >> 
> >> Lets call the instruction: DEAD.
> >
> > So, turns out that when you send INIT to an AP it does the whole drop
> > into Wait-for-SIPI and ignore non-maskable crap.
> >
> > The reason we don't do that is because INIT to CPU0 (BP) is somewhat
> > fatal, but since Thomas killed all that CPU0 hotplug crap, I think we
> > can actually go do that.
> 
> Instead of playing dead or to kick out CPUs from whatever dead play
> routine they are in?
> 
> playimg dead is to stay because INIT will bring back the MCE broadcast
> problem, which we try to avoid by bringing SMT siblings up just to shut
> them down again by playing dead.
> 
> You need a MCE broadcast free system and/or some sensible BIOS bringup
> code for that to work...

Isn't INIT a better state to be in during kexec than HLT?
Peter Zijlstra Nov. 15, 2024, 10:11 a.m. UTC | #15
On Thu, Nov 14, 2024 at 10:54:15PM +0530, Gautham R. Shenoy wrote:
> Hello Peter,
> 
> On Thu, Nov 14, 2024 at 12:58:31PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> > 
> > > AMD platforms won't be using FFH based states for offlined CPUs. We
> > > prefer IO based states when available, and HLT otherwise.
> > > 
> > > > 
> > > > Robustly we'd teach the ACPI driver about FFh and set enter_dead on
> > > > every state -- but we'd have to double check that with AMD.
> > > 
> > > Works for us as long as those FFh states aren't used for play_dead on
> > > AMD platforms.
> > 
> > AFAIU AMD doesn't want to use MWAIT -- ever, not only for offline.
> > Confirm?
> > 
> 
> AMD wants to use MWAIT for cpuidle and it does use MWAIT based C1
> state on both client and server parts.
> 
> Eg: On my server box
> 
> $ cpupower idle-info  | grep "FFH" -B1 -A3
> C1:
> Flags/Description: ACPI FFH MWAIT 0x0
> Latency: 1
> Usage: 6591
> Duration: 1482606
> 
> > But if it were to use MWAIT for regular idle, then surely it's OK for
> > offline too, right?
> 
> I tried this out today and there is no functional issue.
> 
> However, I would like to run some experiments on whether HLT provides
> better power savings than MWAIT C1 with CPUs offlined. I will get back
> with this information tomorrow.

Right, but in most cases you'll have C2/C3 with io ports specified and
those will be picked for play_dead anyway. It's just the exceptionally
weird BIOS case where you'll have C2/C3 as FFh -- because random BIOS
person was on drugs that day or something like that.

Anyway, what I'm trying to say is that you'll probably fine without
adding a bunch of if (AMD|HYGON) logic -- the less of that we have, the
better, etc..
Thomas Gleixner Nov. 15, 2024, 3:37 p.m. UTC | #16
On Fri, Nov 15 2024 at 11:07, Peter Zijlstra wrote:
> On Fri, Nov 15, 2024 at 02:21:19AM +0100, Thomas Gleixner wrote:
>> On Thu, Nov 14 2024 at 13:03, Peter Zijlstra wrote:
>> > On Tue, Nov 12, 2024 at 03:01:27PM +0100, Peter Zijlstra wrote:
>> >
>> >> No, not mwait hint. We need an instruction that:
>> >> 
>> >>  - goes to deepest C state
>> >>  - drops into WAIT-for-Start-IPI (SIPI)
>> >> 
>> >> Notably, it should not wake from:
>> >> 
>> >>  - random memory writes
>> >>  - NMI, MCE, SMI and other such non-maskable thingies
>> >>  - anything else -- the memory pointed to by RIP might no longer exist
>> >> 
>> >> Lets call the instruction: DEAD.
>> >
>> > So, turns out that when you send INIT to an AP it does the whole drop
>> > into Wait-for-SIPI and ignore non-maskable crap.
>> >
>> > The reason we don't do that is because INIT to CPU0 (BP) is somewhat
>> > fatal, but since Thomas killed all that CPU0 hotplug crap, I think we
>> > can actually go do that.
>> 
>> Instead of playing dead or to kick out CPUs from whatever dead play
>> routine they are in?
>> 
>> playimg dead is to stay because INIT will bring back the MCE broadcast
>> problem, which we try to avoid by bringing SMT siblings up just to shut
>> them down again by playing dead.
>> 
>> You need a MCE broadcast free system and/or some sensible BIOS bringup
>> code for that to work...
>
> Isn't INIT a better state to be in during kexec than HLT?

For the kexec transition I agree that INIT is better. We just need to be
careful for a crash kexec when the crashing CPU is not CPU0...

Thanks,

        tglx
diff mbox series

Patch

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 44c40781bad6..721bb931181c 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1416,9 +1416,9 @@  void native_play_dead(void)
 	play_dead_common();
 	tboot_shutdown(TB_SHUTDOWN_WFS);
 
-	mwait_play_dead();
 	if (cpuidle_play_dead())
-		hlt_play_dead();
+		mwait_play_dead();
+	hlt_play_dead();
 }
 
 #else /* ... !CONFIG_HOTPLUG_CPU */