mbox series

[RFT,v1,0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT

Message ID 3344336.aeNJFYEL58@rjwysocki.net
Headers show
Series cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT | expand

Message

Rafael J. Wysocki April 16, 2025, 5:44 p.m. UTC
Hi Everyone,

This is a new version of

https://lore.kernel.org/linux-pm/22640172.EfDdHjke4D@rjwysocki.net/

which is not regarded as RFC any more.  It appears to be better than

https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/

but still requires more testing, so I'd appreciate any help here.

The following paragraph from the original cover letter still applies:

"The underlying observation is that on the platforms targeted by these changes,
Lunar Lake at the time of this writing, the "small" CPUs (E-cores), when run at
the same performance level, are always more energy-efficient than the "big" or
"performance" CPUs (P-cores).  This means that, regardless of the scale-
invariant utilization of a task, as long as there is enough spare capacity on
E-cores, the relative cost of running it there is always lower."

The first 3 patches have been updated since v0.3 and they now depend on the new
cpufreq material in linux-next.

The next 2 patches (Energy Model code changes) have been reviewed in the
meantime, but they are only needed for the last 3 patches.

Patch [6/8] is essentially the same as before.  It causes perf domains to be
registered per CPU and in addition to the primary cost component, which is
related to the CPU type, there is a small component proportional to performance
whose role is to help balance the load between CPUs of the same type.

This is done to avoid migrating tasks too much between CPUs of the same type,
especially between E-cores, which has been observed in tests of

https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/

The expected effect is still that the CPUs of the "low-cost" type will be
preferred so long as there is enough spare capacity on any of them.

The last 2 patches are new.

Patch [7/8] looks at the cache topology to avoid creating per-CPU perf domains
for CPUs sharing an L2 cache.  Typically, on the chips that will be affected
by this patch, CPUs sharing an L2 cache also share a voltage regulator and a
clock, so they technically belong to the same OPP domain and they will be put
into a shared perf domain after this patch.

Patch [8/8] makes CPUs sharing the L3 cache look slightly more expensive to
cause the scheduler to prefer placing tasks on CPUs that don't use the L3,
which in some cases should allow all of the CPUs sharing the L3 to stay in
idle states and the energy usage should be reduced.

Please refer to the individual patch changelogs for details.

Since patches [7-8/8] also apply on top of the v0.3, I have created a git branch at

git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
experimental/intel_pstate/eas-take2-extended

or

https://web.git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate/eas-take2-extended

to allow the difference they make with respect to the v0.3 to be seen (if any).

Later, I'm going to put this series as a whole into a new git branch on top of
the mainline and the cpufreq material queued up for 6.16.

Thanks!

Comments

Christian Loehle April 17, 2025, 12:23 p.m. UTC | #1
On 4/16/25 18:48, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Add a helper for checking if schedutil is the current governor for
> a given cpufreq policy and use it in sched_is_eas_possible() to avoid
> accessing cpufreq policy internals directly from there.
> 
> No intentional functional impact.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Christian Loehle April 17, 2025, 12:42 p.m. UTC | #2
On 4/16/25 19:10, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> On some hybrid platforms a group of cores (referred to as a module) may
> share an L2 cache in which case they also share a voltage regulator and
> always run at the same frequency (while not in idle states).
> 
> For this reason, make hybrid_register_perf_domain() in the intel_pstate
> driver add all CPUs sharing an L2 cache to the same perf domain for EAS.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> New in v1.
> 
> ---
>  drivers/cpufreq/intel_pstate.c |   23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -999,8 +999,11 @@
>  {
>  	static const struct em_data_callback cb
>  			= EM_ADV_DATA_CB(hybrid_active_power, hybrid_get_cost);
> +	struct cpu_cacheinfo *cacheinfo = get_cpu_cacheinfo(cpu);
> +	const struct cpumask *cpumask = cpumask_of(cpu);
>  	struct cpudata *cpudata = all_cpu_data[cpu];
>  	struct device *cpu_dev;
> +	int ret;
>  
>  	/*
>  	 * Registering EM perf domains without enabling asymmetric CPU capacity
> @@ -1014,9 +1017,25 @@
>  	if (!cpu_dev)
>  		return false;
>  
> -	if (em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
> -					cpumask_of(cpu), false))
> +	if (cacheinfo) {
> +		unsigned int i;
> +
> +		/* Find the L2 cache and the CPUs sharing it. */
> +		for (i = 0; i < cacheinfo->num_leaves; i++) {
> +			if (cacheinfo->info_list[i].level == 2) {
> +				cpumask = &cacheinfo->info_list[i].shared_cpu_map;
> +				break;
> +			}
> +		}
> +	}
> +
> +	ret = em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
> +					  cpumask, false);
> +	if (ret) {
> +		cpudata->em_registered = ret == -EEXIST;
> +
>  		return false;
> +	}
>  
>  	cpudata->em_registered = true;
>  
> 

debugfs already provides a way to retrieve that information, but with more
complex perf domain constructions like here maybe this would be useful
(maybe it already is):

--->8---

Subject: [PATCH] PM: EM: Print CPUs of perf domains

In preparation for future EAS users who make the relation from CPU
to perf-domain not strictly based on cpufreq policies print the
affected CPUs when registering a perf-domain.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/power/energy_model.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 99a1ae324c2d..a202968b2ee9 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -627,7 +627,7 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 	em_cpufreq_update_efficiencies(dev, em_table->state);
 
 	em_debug_create_pd(dev);
-	dev_info(dev, "EM: created perf domain\n");
+	dev_info(dev, "EM: created perf domain for CPUs %*pbl\n", cpumask_pr_args(cpus));
 
 unlock:
 	mutex_unlock(&em_pd_mutex);
Christian Loehle April 17, 2025, 1:03 p.m. UTC | #3
On 4/17/25 14:01, Rafael J. Wysocki wrote:
> On Thu, Apr 17, 2025 at 2:19 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 4/16/25 19:01, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Some cpufreq drivers, like intel_pstate, have built-in governors that
>>> are used instead of regular cpufreq governors, schedutil in particular,
>>> but they can work with EAS just fine, so allow EAS to be used with
>>> those drivers.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>
>>> v0.3 -> v1
>>>      * Rebase on top of the new [1-2/8].
>>>      * Update the diagnostic message printed if the conditions are not met.
>>>
>>> This patch is regarded as a cleanup for 6.16.
>>>
>>> ---
>>>  drivers/cpufreq/cpufreq.c |   13 +++++++++++--
>>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>>
>>> --- a/drivers/cpufreq/cpufreq.c
>>> +++ b/drivers/cpufreq/cpufreq.c
>>> @@ -3054,7 +3054,16 @@
>>>
>>>       guard(cpufreq_policy_read)(policy);
>>>
>>> -     return sugov_is_governor(policy);
>>> +     /*
>>> +      * For EAS compatibility, require that either schedutil is the policy
>>> +      * governor or the policy is governed directly by the cpufreq driver.
>>> +      *
>>> +      * In the latter case, it is assumed that EAS can only be enabled by the
>>> +      * cpufreq driver itself which will not enable EAS if it does not meet
>>> +      * the EAS' expectations regarding performance scaling response.
>>> +      */
>>> +     return sugov_is_governor(policy) || (!policy->governor &&
>>> +             policy->policy != CPUFREQ_POLICY_UNKNOWN);
>>>  }
>>>
>>>  bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask)
>>> @@ -3064,7 +3073,7 @@
>>>       /* Do not attempt EAS if schedutil is not being used. */
>>>       for_each_cpu(cpu, cpu_mask) {
>>>               if (!cpufreq_policy_is_good_for_eas(cpu)) {
>>> -                     pr_debug("rd %*pbl: schedutil is mandatory for EAS\n",
>>> +                     pr_debug("rd %*pbl: EAS requirements not met\n",
>>>                                cpumask_pr_args(cpu_mask));
>>
>> I'd prefer to have at least "EAS cpufreq requirements" printed here.
> 
> Sure.
> 
>> with that caveat
>> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
>>
>> Maybe we should amend the EAS documentation to reflect this?
> 
> Yes, the documentation should be updated.  Which piece of it in
> particular I need to look at?

Documentation/scheduler/sched-energy.rst
has:
6.4 - Schedutil governor
so at least there.
Rafael J. Wysocki April 18, 2025, 7:52 p.m. UTC | #4
On Fri, Apr 18, 2025 at 11:58 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 4/16/25 18:44, Rafael J. Wysocki wrote:
> > Hi Everyone,
> >
> > This is a new version of
> >
> > https://lore.kernel.org/linux-pm/22640172.EfDdHjke4D@rjwysocki.net/
> >
> > which is not regarded as RFC any more.  It appears to be better than
> >
> > https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/
> >
> > but still requires more testing, so I'd appreciate any help here.
> >
> > The following paragraph from the original cover letter still applies:
> >
> > "The underlying observation is that on the platforms targeted by these changes,
> > Lunar Lake at the time of this writing, the "small" CPUs (E-cores), when run at
> > the same performance level, are always more energy-efficient than the "big" or
> > "performance" CPUs (P-cores).  This means that, regardless of the scale-
> > invariant utilization of a task, as long as there is enough spare capacity on
> > E-cores, the relative cost of running it there is always lower."
> >
> > The first 3 patches have been updated since v0.3 and they now depend on the new
> > cpufreq material in linux-next.
> >
> > The next 2 patches (Energy Model code changes) have been reviewed in the
> > meantime, but they are only needed for the last 3 patches.
> >
> > Patch [6/8] is essentially the same as before.  It causes perf domains to be
> > registered per CPU and in addition to the primary cost component, which is
> > related to the CPU type, there is a small component proportional to performance
> > whose role is to help balance the load between CPUs of the same type.
> >
> > This is done to avoid migrating tasks too much between CPUs of the same type,
> > especially between E-cores, which has been observed in tests of
> >
> > https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/
> >
> > The expected effect is still that the CPUs of the "low-cost" type will be
> > preferred so long as there is enough spare capacity on any of them.
> >
> > The last 2 patches are new.
> >
> > Patch [7/8] looks at the cache topology to avoid creating per-CPU perf domains
> > for CPUs sharing an L2 cache.  Typically, on the chips that will be affected
> > by this patch, CPUs sharing an L2 cache also share a voltage regulator and a
> > clock, so they technically belong to the same OPP domain and they will be put
> > into a shared perf domain after this patch.
> >
> > Patch [8/8] makes CPUs sharing the L3 cache look slightly more expensive to
> > cause the scheduler to prefer placing tasks on CPUs that don't use the L3,
> > which in some cases should allow all of the CPUs sharing the L3 to stay in
> > idle states and the energy usage should be reduced.
> >
> > Please refer to the individual patch changelogs for details.
> >
> > Since patches [7-8/8] also apply on top of the v0.3, I have created a git branch at
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
> > experimental/intel_pstate/eas-take2-extended
> >
> > or
> >
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate/eas-take2-extended
> >
> > to allow the difference they make with respect to the v0.3 to be seen (if any).
> >
> > Later, I'm going to put this series as a whole into a new git branch on top of
> > the mainline and the cpufreq material queued up for 6.16.
> >
> > Thanks!
> >
>
> Similar to the v0.3 tests done here:
> https://lore.kernel.org/linux-pm/6ab0531a-d6d8-46ac-9afc-23cf87f37905@arm.com/
> here are the results for the same raptor lake nosmt machine (now with
> 4 e-cores + 4 e-cores and 8x1 p-core PDs, 10 PDs in total).
>
> Firefox YouTube 4K video playback:
> EAS:
> 684.504 +-19.167841239372198
> CAS:
> 929.83 +-50.41498564690636
> (-26.3844% energy used with EAS)
> (cf. -43.1% energy used with EAS v0.3)
> (cf. -24.2% energy used with EAS v0.2)
>
> Firefox Web Aquarium 500 fish.
> EAS:
> 540.192 +-14.294833410089904
> CAS:
> 712.896 +-16.821304745272684
> (-24.2257% energy used with EAS)
> (cf. -35.6% energy used with EAS v0.3)
>
> Seems the per-CPU PD worked better, at least for this machine, which arguably
> isn't the main target of the series.

But it is an interesting data point.

We still cannot get this much of a difference on other systems in our
labs, even with the same type of workload.

Thanks for the results, much appreciated!

> Feel free to add
> Tested-by: Christian Loehle <christian.loehle@arm.com>
> to patches 1 to 7 (the tested system isn't affected by 8/8).

Thank you!