Message ID | 20240108134843.429769-1-vincent.guittot@linaro.org |
---|---|
Headers | show |
Series | Rework system pressure interface to the scheduler | expand |
On Mon, 8 Jan 2024 at 17:35, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 08/01/2024 14:48, Vincent Guittot wrote: > > Provide to the scheduler a feedback about the temporary max available > > capacity. Unlike arch_update_thermal_pressure, this doesn't need to be > > filtered as the pressure will happen for dozens ms or more. > > Is this then related to the 'medium pace system pressure' you mentioned > in your OSPM '23 talk? > > > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > --- > > drivers/cpufreq/cpufreq.c | 36 ++++++++++++++++++++++++++++++++++++ > > include/linux/cpufreq.h | 10 ++++++++++ > > 2 files changed, 46 insertions(+) > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > > index 44db4f59c4cc..fa2e2ea26f7f 100644 > > --- a/drivers/cpufreq/cpufreq.c > > +++ b/drivers/cpufreq/cpufreq.c > > @@ -2563,6 +2563,40 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu) > > } > > EXPORT_SYMBOL(cpufreq_get_policy); > > > > +DEFINE_PER_CPU(unsigned long, cpufreq_pressure); > > + > > +/** > > + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs > > + * @policy: cpufreq policy of the CPUs. > > + * > > + * Update the value of cpufreq pressure for all @cpus in the policy. > > + */ > > +static void cpufreq_update_pressure(struct cpufreq_policy *policy) > > +{ > > + unsigned long max_capacity, capped_freq, pressure; > > + u32 max_freq; > > + int cpu; > > + > > + /* > > + * Handle properly the boost frequencies, which should simply clean > > + * the thermal pressure value. > ^^^^^^^ > IMHO, this is a copy & paste error from topology_update_thermal_pressure()? > > > + */ > > + if (max_freq <= capped_freq) { > > max_freq seems to be uninitialized. argh yes, I made crap while cleaning up both max_freq and capped_freq are uninitialized > > > + pressure = 0; > > Is this x86 (turbo boost) specific? IMHO at arm we follow this max freq > (including boost) relates to 1024 in capacity? Or haven't we made this > discussion yet? This is not x86 specific. We can have capped_freq > max_freq on Arm too Also this bypass all calculation below when max_freq == capped_freq which is the most common case > > > + } else { > > + cpu = cpumask_first(policy->related_cpus); > > + max_capacity = arch_scale_cpu_capacity(cpu); > > + capped_freq = policy->max; > > + max_freq = arch_scale_freq_ref(cpu); > > + > > + pressure = max_capacity - > > + mult_frac(max_capacity, capped_freq, max_freq); > > + } > > + > > + for_each_cpu(cpu, policy->related_cpus) > > + WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure); > > +} > > + > > [...] >
On Mon, 8 Jan 2024 at 17:35, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 08/01/2024 14:48, Vincent Guittot wrote: > > Provide to the scheduler a feedback about the temporary max available > > capacity. Unlike arch_update_thermal_pressure, this doesn't need to be > > filtered as the pressure will happen for dozens ms or more. > > Is this then related to the 'medium pace system pressure' you mentioned > in your OSPM '23 talk? Sorry I forgot to answer this question. Yes this is the medium pace system pressure that I mentioned at OSPM'23 > > >
On 08/01/2024 14:48, Vincent Guittot wrote: > Following the consolidation and cleanup of CPU capacity in [1], this serie > reworks how the scheduler gets the pressures on CPUs. We need to take into > account all pressures applied by cpufreq on the compute capacity of a CPU > for dozens of ms or more and not only cpufreq cooling device or HW > mitigiations. we split the pressure applied on CPU's capacity in 2 parts: > - one from cpufreq and freq_qos > - one from HW high freq mitigiation. > > The next step will be to add a dedicated interface for long standing > capping of the CPU capacity (i.e. for seconds or more) like the > scaling_max_freq of cpufreq sysfs. The latter is already taken into > account by this serie but as a temporary pressure which is not always the > best choice when we know that it will happen for seconds or more. I guess this is related to the 'user space system pressure' (*) slide of your OSPM '23 talk. Where do you draw the line when it comes to time between (*) and the 'medium pace system pressure' (e.g. thermal and FREQ_QOS). IIRC, with (*) you want to rebuild the sched domains etc. > > [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/ > > Change since v1: > - Rework cpufreq_update_pressure() > > Change since v1: > - Use struct cpufreq_policy as parameter of cpufreq_update_pressure() > - Fix typos and comments > - Make sched_thermal_decay_shift boot param as deprecated > > Vincent Guittot (5): > cpufreq: Add a cpufreq pressure feedback for the scheduler > sched: Take cpufreq feedback into account > thermal/cpufreq: Remove arch_update_thermal_pressure() > sched: Rename arch_update_thermal_pressure into > arch_update_hw_pressure > sched/pelt: Remove shift of thermal clock > > .../admin-guide/kernel-parameters.txt | 1 + > arch/arm/include/asm/topology.h | 6 +- > arch/arm64/include/asm/topology.h | 6 +- > drivers/base/arch_topology.c | 26 ++++---- > drivers/cpufreq/cpufreq.c | 36 +++++++++++ > drivers/cpufreq/qcom-cpufreq-hw.c | 4 +- > drivers/thermal/cpufreq_cooling.c | 3 - > include/linux/arch_topology.h | 8 +-- > include/linux/cpufreq.h | 10 +++ > include/linux/sched/topology.h | 8 +-- > .../{thermal_pressure.h => hw_pressure.h} | 14 ++--- > include/trace/events/sched.h | 2 +- > init/Kconfig | 12 ++-- > kernel/sched/core.c | 8 +-- > kernel/sched/fair.c | 63 +++++++++---------- > kernel/sched/pelt.c | 18 +++--- > kernel/sched/pelt.h | 16 ++--- > kernel/sched/sched.h | 22 +------ > 18 files changed, 144 insertions(+), 119 deletions(-) > rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
On Tue, 9 Jan 2024 at 12:34, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 08/01/2024 14:48, Vincent Guittot wrote: > > Following the consolidation and cleanup of CPU capacity in [1], this serie > > reworks how the scheduler gets the pressures on CPUs. We need to take into > > account all pressures applied by cpufreq on the compute capacity of a CPU > > for dozens of ms or more and not only cpufreq cooling device or HW > > mitigiations. we split the pressure applied on CPU's capacity in 2 parts: > > - one from cpufreq and freq_qos > > - one from HW high freq mitigiation. > > > > The next step will be to add a dedicated interface for long standing > > capping of the CPU capacity (i.e. for seconds or more) like the > > scaling_max_freq of cpufreq sysfs. The latter is already taken into > > account by this serie but as a temporary pressure which is not always the > > best choice when we know that it will happen for seconds or more. > > I guess this is related to the 'user space system pressure' (*) slide of > your OSPM '23 talk. yes > > Where do you draw the line when it comes to time between (*) and the > 'medium pace system pressure' (e.g. thermal and FREQ_QOS). My goal is to consider the /sys/../scaling_max_freq as the 'user space system pressure' > > IIRC, with (*) you want to rebuild the sched domains etc. The easiest way would be to rebuild the sched_domain but the cost is not small so I would prefer to skip the rebuild and add a new signal that keep track on this capped capacity > > > > > [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/ > > > > Change since v1: > > - Rework cpufreq_update_pressure() > > > > Change since v1: > > - Use struct cpufreq_policy as parameter of cpufreq_update_pressure() > > - Fix typos and comments > > - Make sched_thermal_decay_shift boot param as deprecated > > > > Vincent Guittot (5): > > cpufreq: Add a cpufreq pressure feedback for the scheduler > > sched: Take cpufreq feedback into account > > thermal/cpufreq: Remove arch_update_thermal_pressure() > > sched: Rename arch_update_thermal_pressure into > > arch_update_hw_pressure > > sched/pelt: Remove shift of thermal clock > > > > .../admin-guide/kernel-parameters.txt | 1 + > > arch/arm/include/asm/topology.h | 6 +- > > arch/arm64/include/asm/topology.h | 6 +- > > drivers/base/arch_topology.c | 26 ++++---- > > drivers/cpufreq/cpufreq.c | 36 +++++++++++ > > drivers/cpufreq/qcom-cpufreq-hw.c | 4 +- > > drivers/thermal/cpufreq_cooling.c | 3 - > > include/linux/arch_topology.h | 8 +-- > > include/linux/cpufreq.h | 10 +++ > > include/linux/sched/topology.h | 8 +-- > > .../{thermal_pressure.h => hw_pressure.h} | 14 ++--- > > include/trace/events/sched.h | 2 +- > > init/Kconfig | 12 ++-- > > kernel/sched/core.c | 8 +-- > > kernel/sched/fair.c | 63 +++++++++---------- > > kernel/sched/pelt.c | 18 +++--- > > kernel/sched/pelt.h | 16 ++--- > > kernel/sched/sched.h | 22 +------ > > 18 files changed, 144 insertions(+), 119 deletions(-) > > rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%) >
On 09/01/2024 14:29, Vincent Guittot wrote: > On Tue, 9 Jan 2024 at 12:34, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >> >> On 08/01/2024 14:48, Vincent Guittot wrote: >>> Following the consolidation and cleanup of CPU capacity in [1], this serie >>> reworks how the scheduler gets the pressures on CPUs. We need to take into >>> account all pressures applied by cpufreq on the compute capacity of a CPU >>> for dozens of ms or more and not only cpufreq cooling device or HW >>> mitigiations. we split the pressure applied on CPU's capacity in 2 parts: >>> - one from cpufreq and freq_qos >>> - one from HW high freq mitigiation. >>> >>> The next step will be to add a dedicated interface for long standing >>> capping of the CPU capacity (i.e. for seconds or more) like the >>> scaling_max_freq of cpufreq sysfs. The latter is already taken into >>> account by this serie but as a temporary pressure which is not always the >>> best choice when we know that it will happen for seconds or more. >> >> I guess this is related to the 'user space system pressure' (*) slide of >> your OSPM '23 talk. > > yes > >> >> Where do you draw the line when it comes to time between (*) and the >> 'medium pace system pressure' (e.g. thermal and FREQ_QOS). > > My goal is to consider the /sys/../scaling_max_freq as the 'user space > system pressure' > >> >> IIRC, with (*) you want to rebuild the sched domains etc. > > The easiest way would be to rebuild the sched_domain but the cost is > not small so I would prefer to skip the rebuild and add a new signal > that keep track on this capped capacity Are you saying that you don't need to rebuild sched domains since cpu_capacity information of the sched domain hierarchy is independently updated via: update_sd_lb_stats() { update_group_capacity() { if (!child) update_cpu_capacity(sd, cpu) { capacity = scale_rt_capacity(cpu) { max = get_actual_cpu_capacity(cpu) <- (*) } sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = capacity; sdg->sgc->max_capacity = capacity; } } } (*) influence of temporary and permanent (to be added) frequency pressure on cpu_capacity (per-cpu and in sd data) example: hackbench on h960 with IPA: cap min max ... hackbench-2284 [007] .Ns.. 2170.796726: update_group_capacity: sdg !child cpu=7 1017 1017 1017 hackbench-2456 [007] ..s.. 2170.920729: update_group_capacity: sdg !child cpu=7 1018 1018 1018 <...>-2314 [007] ..s1. 2171.044724: update_group_capacity: sdg !child cpu=7 1011 1011 1011 hackbench-2541 [007] ..s.. 2171.168734: update_group_capacity: sdg !child cpu=7 918 918 918 hackbench-2558 [007] .Ns.. 2171.228716: update_group_capacity: sdg !child cpu=7 912 912 912 <...>-2321 [007] ..s.. 2171.352718: update_group_capacity: sdg !child cpu=7 812 812 812 hackbench-2553 [007] ..s.. 2171.476721: update_group_capacity: sdg !child cpu=7 640 640 640 <...>-2446 [007] ..s2. 2171.600743: update_group_capacity: sdg !child cpu=7 610 610 610 hackbench-2347 [007] ..s.. 2171.724738: update_group_capacity: sdg !child cpu=7 406 406 406 hackbench-2331 [007] .Ns1. 2171.848768: update_group_capacity: sdg !child cpu=7 390 390 390 hackbench-2421 [007] ..s.. 2171.972733: update_group_capacity: sdg !child cpu=7 388 388 388 ...
On Wed, 10 Jan 2024 at 19:10, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 09/01/2024 14:29, Vincent Guittot wrote: > > On Tue, 9 Jan 2024 at 12:34, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > >> > >> On 08/01/2024 14:48, Vincent Guittot wrote: > >>> Following the consolidation and cleanup of CPU capacity in [1], this serie > >>> reworks how the scheduler gets the pressures on CPUs. We need to take into > >>> account all pressures applied by cpufreq on the compute capacity of a CPU > >>> for dozens of ms or more and not only cpufreq cooling device or HW > >>> mitigiations. we split the pressure applied on CPU's capacity in 2 parts: > >>> - one from cpufreq and freq_qos > >>> - one from HW high freq mitigiation. > >>> > >>> The next step will be to add a dedicated interface for long standing > >>> capping of the CPU capacity (i.e. for seconds or more) like the > >>> scaling_max_freq of cpufreq sysfs. The latter is already taken into > >>> account by this serie but as a temporary pressure which is not always the > >>> best choice when we know that it will happen for seconds or more. > >> > >> I guess this is related to the 'user space system pressure' (*) slide of > >> your OSPM '23 talk. > > > > yes > > > >> > >> Where do you draw the line when it comes to time between (*) and the > >> 'medium pace system pressure' (e.g. thermal and FREQ_QOS). > > > > My goal is to consider the /sys/../scaling_max_freq as the 'user space > > system pressure' > > > >> > >> IIRC, with (*) you want to rebuild the sched domains etc. > > > > The easiest way would be to rebuild the sched_domain but the cost is > > not small so I would prefer to skip the rebuild and add a new signal > > that keep track on this capped capacity > > Are you saying that you don't need to rebuild sched domains since > cpu_capacity information of the sched domain hierarchy is > independently updated via: > > update_sd_lb_stats() { > > update_group_capacity() { > > if (!child) > update_cpu_capacity(sd, cpu) { > > capacity = scale_rt_capacity(cpu) { > > max = get_actual_cpu_capacity(cpu) <- (*) > } > > sdg->sgc->capacity = capacity; > sdg->sgc->min_capacity = capacity; > sdg->sgc->max_capacity = capacity; > } > > } > > } > > (*) influence of temporary and permanent (to be added) frequency > pressure on cpu_capacity (per-cpu and in sd data) I'm more concerned by rd->max_cpu_capacity which remains at original capacity and triggers spurious LB if we take into account the userspace max freq instead of the original max compute capacity of a CPU. And also how to manage this in RT and DL > > > example: hackbench on h960 with IPA: > cap min max > ... > hackbench-2284 [007] .Ns.. 2170.796726: update_group_capacity: sdg !child cpu=7 1017 1017 1017 > hackbench-2456 [007] ..s.. 2170.920729: update_group_capacity: sdg !child cpu=7 1018 1018 1018 > <...>-2314 [007] ..s1. 2171.044724: update_group_capacity: sdg !child cpu=7 1011 1011 1011 > hackbench-2541 [007] ..s.. 2171.168734: update_group_capacity: sdg !child cpu=7 918 918 918 > hackbench-2558 [007] .Ns.. 2171.228716: update_group_capacity: sdg !child cpu=7 912 912 912 > <...>-2321 [007] ..s.. 2171.352718: update_group_capacity: sdg !child cpu=7 812 812 812 > hackbench-2553 [007] ..s.. 2171.476721: update_group_capacity: sdg !child cpu=7 640 640 640 > <...>-2446 [007] ..s2. 2171.600743: update_group_capacity: sdg !child cpu=7 610 610 610 > hackbench-2347 [007] ..s.. 2171.724738: update_group_capacity: sdg !child cpu=7 406 406 406 > hackbench-2331 [007] .Ns1. 2171.848768: update_group_capacity: sdg !child cpu=7 390 390 390 > hackbench-2421 [007] ..s.. 2171.972733: update_group_capacity: sdg !child cpu=7 388 388 388 > ...