Message ID | 3344336.aeNJFYEL58@rjwysocki.net |
---|---|
Headers | show |
Series | cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT | expand |
On 4/16/25 18:48, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Add a helper for checking if schedutil is the current governor for > a given cpufreq policy and use it in sched_is_eas_possible() to avoid > accessing cpufreq policy internals directly from there. > > No intentional functional impact. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
On 4/16/25 19:10, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > On some hybrid platforms a group of cores (referred to as a module) may > share an L2 cache in which case they also share a voltage regulator and > always run at the same frequency (while not in idle states). > > For this reason, make hybrid_register_perf_domain() in the intel_pstate > driver add all CPUs sharing an L2 cache to the same perf domain for EAS. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > New in v1. > > --- > drivers/cpufreq/intel_pstate.c | 23 +++++++++++++++++++++-- > 1 file changed, 21 insertions(+), 2 deletions(-) > > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -999,8 +999,11 @@ > { > static const struct em_data_callback cb > = EM_ADV_DATA_CB(hybrid_active_power, hybrid_get_cost); > + struct cpu_cacheinfo *cacheinfo = get_cpu_cacheinfo(cpu); > + const struct cpumask *cpumask = cpumask_of(cpu); > struct cpudata *cpudata = all_cpu_data[cpu]; > struct device *cpu_dev; > + int ret; > > /* > * Registering EM perf domains without enabling asymmetric CPU capacity > @@ -1014,9 +1017,25 @@ > if (!cpu_dev) > return false; > > - if (em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb, > - cpumask_of(cpu), false)) > + if (cacheinfo) { > + unsigned int i; > + > + /* Find the L2 cache and the CPUs sharing it. */ > + for (i = 0; i < cacheinfo->num_leaves; i++) { > + if (cacheinfo->info_list[i].level == 2) { > + cpumask = &cacheinfo->info_list[i].shared_cpu_map; > + break; > + } > + } > + } > + > + ret = em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb, > + cpumask, false); > + if (ret) { > + cpudata->em_registered = ret == -EEXIST; > + > return false; > + } > > cpudata->em_registered = true; > > debugfs already provides a way to retrieve that information, but with more complex perf domain constructions like here maybe this would be useful (maybe it already is): --->8--- Subject: [PATCH] PM: EM: Print CPUs of perf domains In preparation for future EAS users who make the relation from CPU to perf-domain not strictly based on cpufreq policies print the affected CPUs when registering a perf-domain. Signed-off-by: Christian Loehle <christian.loehle@arm.com> --- kernel/power/energy_model.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c index 99a1ae324c2d..a202968b2ee9 100644 --- a/kernel/power/energy_model.c +++ b/kernel/power/energy_model.c @@ -627,7 +627,7 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states, em_cpufreq_update_efficiencies(dev, em_table->state); em_debug_create_pd(dev); - dev_info(dev, "EM: created perf domain\n"); + dev_info(dev, "EM: created perf domain for CPUs %*pbl\n", cpumask_pr_args(cpus)); unlock: mutex_unlock(&em_pd_mutex);
On 4/17/25 14:01, Rafael J. Wysocki wrote: > On Thu, Apr 17, 2025 at 2:19 PM Christian Loehle > <christian.loehle@arm.com> wrote: >> >> On 4/16/25 19:01, Rafael J. Wysocki wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >>> Some cpufreq drivers, like intel_pstate, have built-in governors that >>> are used instead of regular cpufreq governors, schedutil in particular, >>> but they can work with EAS just fine, so allow EAS to be used with >>> those drivers. >>> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> --- >>> >>> v0.3 -> v1 >>> * Rebase on top of the new [1-2/8]. >>> * Update the diagnostic message printed if the conditions are not met. >>> >>> This patch is regarded as a cleanup for 6.16. >>> >>> --- >>> drivers/cpufreq/cpufreq.c | 13 +++++++++++-- >>> 1 file changed, 11 insertions(+), 2 deletions(-) >>> >>> --- a/drivers/cpufreq/cpufreq.c >>> +++ b/drivers/cpufreq/cpufreq.c >>> @@ -3054,7 +3054,16 @@ >>> >>> guard(cpufreq_policy_read)(policy); >>> >>> - return sugov_is_governor(policy); >>> + /* >>> + * For EAS compatibility, require that either schedutil is the policy >>> + * governor or the policy is governed directly by the cpufreq driver. >>> + * >>> + * In the latter case, it is assumed that EAS can only be enabled by the >>> + * cpufreq driver itself which will not enable EAS if it does not meet >>> + * the EAS' expectations regarding performance scaling response. >>> + */ >>> + return sugov_is_governor(policy) || (!policy->governor && >>> + policy->policy != CPUFREQ_POLICY_UNKNOWN); >>> } >>> >>> bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask) >>> @@ -3064,7 +3073,7 @@ >>> /* Do not attempt EAS if schedutil is not being used. */ >>> for_each_cpu(cpu, cpu_mask) { >>> if (!cpufreq_policy_is_good_for_eas(cpu)) { >>> - pr_debug("rd %*pbl: schedutil is mandatory for EAS\n", >>> + pr_debug("rd %*pbl: EAS requirements not met\n", >>> cpumask_pr_args(cpu_mask)); >> >> I'd prefer to have at least "EAS cpufreq requirements" printed here. > > Sure. > >> with that caveat >> Reviewed-by: Christian Loehle <christian.loehle@arm.com> >> >> Maybe we should amend the EAS documentation to reflect this? > > Yes, the documentation should be updated. Which piece of it in > particular I need to look at? Documentation/scheduler/sched-energy.rst has: 6.4 - Schedutil governor so at least there.
On Fri, Apr 18, 2025 at 11:58 AM Christian Loehle <christian.loehle@arm.com> wrote: > > On 4/16/25 18:44, Rafael J. Wysocki wrote: > > Hi Everyone, > > > > This is a new version of > > > > https://lore.kernel.org/linux-pm/22640172.EfDdHjke4D@rjwysocki.net/ > > > > which is not regarded as RFC any more. It appears to be better than > > > > https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/ > > > > but still requires more testing, so I'd appreciate any help here. > > > > The following paragraph from the original cover letter still applies: > > > > "The underlying observation is that on the platforms targeted by these changes, > > Lunar Lake at the time of this writing, the "small" CPUs (E-cores), when run at > > the same performance level, are always more energy-efficient than the "big" or > > "performance" CPUs (P-cores). This means that, regardless of the scale- > > invariant utilization of a task, as long as there is enough spare capacity on > > E-cores, the relative cost of running it there is always lower." > > > > The first 3 patches have been updated since v0.3 and they now depend on the new > > cpufreq material in linux-next. > > > > The next 2 patches (Energy Model code changes) have been reviewed in the > > meantime, but they are only needed for the last 3 patches. > > > > Patch [6/8] is essentially the same as before. It causes perf domains to be > > registered per CPU and in addition to the primary cost component, which is > > related to the CPU type, there is a small component proportional to performance > > whose role is to help balance the load between CPUs of the same type. > > > > This is done to avoid migrating tasks too much between CPUs of the same type, > > especially between E-cores, which has been observed in tests of > > > > https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/ > > > > The expected effect is still that the CPUs of the "low-cost" type will be > > preferred so long as there is enough spare capacity on any of them. > > > > The last 2 patches are new. > > > > Patch [7/8] looks at the cache topology to avoid creating per-CPU perf domains > > for CPUs sharing an L2 cache. Typically, on the chips that will be affected > > by this patch, CPUs sharing an L2 cache also share a voltage regulator and a > > clock, so they technically belong to the same OPP domain and they will be put > > into a shared perf domain after this patch. > > > > Patch [8/8] makes CPUs sharing the L3 cache look slightly more expensive to > > cause the scheduler to prefer placing tasks on CPUs that don't use the L3, > > which in some cases should allow all of the CPUs sharing the L3 to stay in > > idle states and the energy usage should be reduced. > > > > Please refer to the individual patch changelogs for details. > > > > Since patches [7-8/8] also apply on top of the v0.3, I have created a git branch at > > > > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \ > > experimental/intel_pstate/eas-take2-extended > > > > or > > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate/eas-take2-extended > > > > to allow the difference they make with respect to the v0.3 to be seen (if any). > > > > Later, I'm going to put this series as a whole into a new git branch on top of > > the mainline and the cpufreq material queued up for 6.16. > > > > Thanks! > > > > Similar to the v0.3 tests done here: > https://lore.kernel.org/linux-pm/6ab0531a-d6d8-46ac-9afc-23cf87f37905@arm.com/ > here are the results for the same raptor lake nosmt machine (now with > 4 e-cores + 4 e-cores and 8x1 p-core PDs, 10 PDs in total). > > Firefox YouTube 4K video playback: > EAS: > 684.504 +-19.167841239372198 > CAS: > 929.83 +-50.41498564690636 > (-26.3844% energy used with EAS) > (cf. -43.1% energy used with EAS v0.3) > (cf. -24.2% energy used with EAS v0.2) > > Firefox Web Aquarium 500 fish. > EAS: > 540.192 +-14.294833410089904 > CAS: > 712.896 +-16.821304745272684 > (-24.2257% energy used with EAS) > (cf. -35.6% energy used with EAS v0.3) > > Seems the per-CPU PD worked better, at least for this machine, which arguably > isn't the main target of the series. But it is an interesting data point. We still cannot get this much of a difference on other systems in our labs, even with the same type of workload. Thanks for the results, much appreciated! > Feel free to add > Tested-by: Christian Loehle <christian.loehle@arm.com> > to patches 1 to 7 (the tested system isn't affected by 8/8). Thank you!