Message ID | 20221228165415.3436-1-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
Series | [v2] sched/fair: unlink misfit task from cpu overutilized | expand |
Hi Vincent On 12/28/22 17:54, Vincent Guittot wrote: > By taking into account uclamp_min, the 1:1 relation between task misfit > and cpu overutilized is no more true as a task with a small util_avg of > may not may not fit a high capacity cpu because of uclamp_min constraint. Wouldn't it be better to split this into two patches * Unlink/Decouple misfit ... * Unlink/Decouple util_fits_cpu from HMP ? > > Add a new state in util_fits_cpu() to reflect the case that task would fit > a CPU except for the uclamp_min hint which is a performance requirement. > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we > can use this new value to take additional action to select the best CPU > that doesn't match uclamp_min hint. This part has nothing to do with the commit subject. I think it's better to split the patches if it's not too much work for you. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > > Change since v1: > - fix some wrong conditions > - take into account more cases > > kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++------------ > 1 file changed, 74 insertions(+), 25 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 1649e7d71d24..57077f0a897e 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4582,8 +4582,7 @@ static inline int util_fits_cpu(unsigned long util, > * 2. The system is being saturated when we're operating near > * max capacity, it doesn't make sense to block overutilized. > */ > - uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE); > - uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig); > + uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE); > fits = fits || uclamp_max_fits; > > /* > @@ -4618,8 +4617,8 @@ static inline int util_fits_cpu(unsigned long util, > * handle the case uclamp_min > uclamp_max. > */ > uclamp_min = min(uclamp_min, uclamp_max); > - if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE) > - fits = fits && (uclamp_min <= capacity_orig_thermal); > + if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal)) > + return -1; > > return fits; > } > @@ -4629,7 +4628,7 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu) > unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN); > unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX); > unsigned long util = task_util_est(p); > - return util_fits_cpu(util, uclamp_min, uclamp_max, cpu); > + return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0); So the big difference between your approach and my approach is that task_fits_cpu() and asym_fits_cpu() now are very strict in regards to thermal pressure since with your approach we delegate the smartness to the caller. Should we add a comment for these 2 users to make it obvious we intentionally ignore the '-1' value and why it is okay? I'm not sure I can write a reasonable rationale myself. I'm actually worried this might subtly break decisions made by select_idle_capacity() or feec() when doing the LB. Have you considered this? > } > > static inline void update_misfit_status(struct task_struct *p, struct rq *rq) > @@ -6864,6 +6863,7 @@ static int > select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) > { > unsigned long task_util, util_min, util_max, best_cap = 0; > + int fits, best_fits = 0; > int cpu, best_cpu = -1; > struct cpumask *cpus; > > @@ -6879,12 +6879,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) > > if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu)) > continue; > - if (util_fits_cpu(task_util, util_min, util_max, cpu)) > + > + fits = util_fits_cpu(task_util, util_min, util_max, cpu); > + > + /* This CPU fits with all capacity and performance requirements */ > + if (fits > 0) > return cpu; > + /* > + * Only the min performance (i.e. uclamp_min) doesn't fit. Look > + * for the CPU with highest performance capacity. > + */ > + else if (fits < 0) > + cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)); > > - if (cpu_cap > best_cap) { > + /* > + * First, select cpu which fits better (-1 being better than 0). > + * Then, select the one with largest capacity at same level. > + */ > + if ((fits < best_fits) || > + ((fits == best_fits) && (cpu_cap > best_cap))) { > best_cap = cpu_cap; > best_cpu = cpu; > + best_fits = fits; > } > } > > @@ -6897,7 +6913,7 @@ static inline bool asym_fits_cpu(unsigned long util, > int cpu) > { > if (sched_asym_cpucap_active()) > - return util_fits_cpu(util, util_min, util_max, cpu); > + return (util_fits_cpu(util, util_min, util_max, cpu) > 0); > > return true; > } > @@ -7264,6 +7280,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024; > struct root_domain *rd = this_rq()->rd; > int cpu, best_energy_cpu, target = -1; > + int prev_fits = -1, best_fits = -1; > + unsigned long best_thermal_cap = 0; > + unsigned long prev_thermal_cap = 0; > struct sched_domain *sd; > struct perf_domain *pd; > struct energy_env eenv; > @@ -7299,6 +7318,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > unsigned long prev_spare_cap = 0; > int max_spare_cap_cpu = -1; > unsigned long base_energy; > + int fits, max_fits = -1; > > cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); > > @@ -7351,7 +7371,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > util_max = max(rq_util_max, p_util_max); > } > } > - if (!util_fits_cpu(util, util_min, util_max, cpu)) > + > + fits = util_fits_cpu(util, util_min, util_max, cpu); > + if (!fits) > continue; > > lsub_positive(&cpu_cap, util); > @@ -7359,7 +7381,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > if (cpu == prev_cpu) { > /* Always use prev_cpu as a candidate. */ > prev_spare_cap = cpu_cap; > - } else if (cpu_cap > max_spare_cap) { > + prev_fits = fits; > + } else if ((fits > max_fits) || > + ((fits == max_fits) && (cpu_cap > max_spare_cap))) { > /* > * Find the CPU with the maximum spare capacity > * among the remaining CPUs in the performance > @@ -7367,6 +7391,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > */ > max_spare_cap = cpu_cap; > max_spare_cap_cpu = cpu; > + max_fits = fits; > } > } > > @@ -7385,26 +7410,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > if (prev_delta < base_energy) > goto unlock; > prev_delta -= base_energy; > + prev_thermal_cap = cpu_thermal_cap; > best_delta = min(best_delta, prev_delta); > } > > /* Evaluate the energy impact of using max_spare_cap_cpu. */ > if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) { > + /* Current best energy cpu fits better */ > + if (max_fits < best_fits) > + continue; > + > + /* > + * Both don't fit performance (i.e. uclamp_min) but > + * best energy cpu has better performance. > + */ > + if ((max_fits < 0) && > + (cpu_thermal_cap <= best_thermal_cap)) > + continue; > + > cur_delta = compute_energy(&eenv, pd, cpus, p, > max_spare_cap_cpu); > /* CPU utilization has changed */ > if (cur_delta < base_energy) > goto unlock; > cur_delta -= base_energy; > - if (cur_delta < best_delta) { > - best_delta = cur_delta; > - best_energy_cpu = max_spare_cap_cpu; > - } > + > + /* > + * Both fit for the task but best energy cpu has lower > + * energy impact. > + */ > + if ((max_fits > 0) && Shouldn't this be if ((max_fits > 0) && (max_fits == best_fits) && ? We should update best_delta unconditionally first time we hit max_fits = 1, no? I think it's worth extending the comment with something along the lines of * ... except for the first time max_fits becomes 1 * then we must update best_delta unconditionally > + (cur_delta >= best_delta)) > + continue; > + > + best_delta = cur_delta; > + best_energy_cpu = max_spare_cap_cpu; > + best_fits = max_fits; > + best_thermal_cap = cpu_thermal_cap; > } > } > rcu_read_unlock(); > > - if (best_delta < prev_delta) > + if ((best_fits > prev_fits) || > + ((best_fits > 0) && (best_delta < prev_delta)) || > + ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap))) > target = best_energy_cpu; Overall I think the approach is sound. I tested it on my pinebook pro and couldn't catch obvious breakage at least. I am still worried though about spilling the knowledge outside of util_fits_cpu() is creating extra complexity in the callers and potentially more fragility when these callers evolve overtime e.g: task_fits_cpu()/asym_fits_cpu() gain a new user that must actually care about the -1 return value. I think we can still optimize the capacity inversion logic to use no loops without having to spill the knowledge to the users/callers of util_fits_cpu(), no? That said except for the few comments I had this LGTM anyway. Thanks for your effort! Cheers -- Qais Yousef > > return target; > @@ -10228,24 +10277,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env) > */ > update_sd_lb_stats(env, &sds); > > - if (sched_energy_enabled()) { > - struct root_domain *rd = env->dst_rq->rd; > - > - if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized)) > - goto out_balanced; > - } > - > - local = &sds.local_stat; > - busiest = &sds.busiest_stat; > - > /* There is no busy sibling group to pull tasks from */ > if (!sds.busiest) > goto out_balanced; > > + busiest = &sds.busiest_stat; > + > /* Misfit tasks should be dealt with regardless of the avg load */ > if (busiest->group_type == group_misfit_task) > goto force_balance; > > + if (sched_energy_enabled()) { > + struct root_domain *rd = env->dst_rq->rd; > + > + if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized)) > + goto out_balanced; > + } > + > /* ASYM feature bypasses nice load balance check */ > if (busiest->group_type == group_asym_packing) > goto force_balance; > @@ -10258,6 +10306,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) > if (busiest->group_type == group_imbalanced) > goto force_balance; > > + local = &sds.local_stat; > /* > * If the local group is busier than the selected busiest group > * don't try and pull any tasks. > -- > 2.17.1 >
On Thu, 12 Jan 2023 at 15:26, Qais Yousef <qyousef@layalina.io> wrote: > > Hi Vincent > > On 12/28/22 17:54, Vincent Guittot wrote: > > By taking into account uclamp_min, the 1:1 relation between task misfit > > and cpu overutilized is no more true as a task with a small util_avg of > > may not may not fit a high capacity cpu because of uclamp_min constraint. > > Wouldn't it be better to split this into two patches > > * Unlink/Decouple misfit ... > * Unlink/Decouple util_fits_cpu from HMP > > ? I'm afraid that git bisect could then raise a false positive between the 2 commits > > > > > Add a new state in util_fits_cpu() to reflect the case that task would fit > > a CPU except for the uclamp_min hint which is a performance requirement. > > > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we > > can use this new value to take additional action to select the best CPU > > that doesn't match uclamp_min hint. > > This part has nothing to do with the commit subject. I think it's better to > split the patches if it's not too much work for you. > > > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > --- > > > > Change since v1: > > - fix some wrong conditions > > - take into account more cases > > > > kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++------------ > > 1 file changed, 74 insertions(+), 25 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 1649e7d71d24..57077f0a897e 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4582,8 +4582,7 @@ static inline int util_fits_cpu(unsigned long util, > > * 2. The system is being saturated when we're operating near > > * max capacity, it doesn't make sense to block overutilized. > > */ > > - uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE); > > - uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig); > > + uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE); > > fits = fits || uclamp_max_fits; > > > > /* > > @@ -4618,8 +4617,8 @@ static inline int util_fits_cpu(unsigned long util, > > * handle the case uclamp_min > uclamp_max. > > */ > > uclamp_min = min(uclamp_min, uclamp_max); > > - if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE) > > - fits = fits && (uclamp_min <= capacity_orig_thermal); > > + if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal)) > > + return -1; > > > > return fits; > > } > > @@ -4629,7 +4628,7 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu) > > unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN); > > unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX); > > unsigned long util = task_util_est(p); > > - return util_fits_cpu(util, uclamp_min, uclamp_max, cpu); > > + return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0); > > So the big difference between your approach and my approach is that > task_fits_cpu() and asym_fits_cpu() now are very strict in regards to thermal > pressure since with your approach we delegate the smartness to the caller. > > Should we add a comment for these 2 users to make it obvious we intentionally > ignore the '-1' value and why it is okay? I can probably add something saying that a positive value (1 in this case) is the only one that says that a task fits to a cpu. Other returned values imply that either the utilization or the uclamp constraints are not meet > > I'm not sure I can write a reasonable rationale myself. I'm actually worried > this might subtly break decisions made by select_idle_capacity() or feec() when > doing the LB. > > Have you considered this? Yes, that why i have keep the changes in 1 patch > > > } > > > > static inline void update_misfit_status(struct task_struct *p, struct rq *rq) > > @@ -6864,6 +6863,7 @@ static int > > select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) > > { > > unsigned long task_util, util_min, util_max, best_cap = 0; > > + int fits, best_fits = 0; > > int cpu, best_cpu = -1; > > struct cpumask *cpus; > > > > @@ -6879,12 +6879,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) > > > > if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu)) > > continue; > > - if (util_fits_cpu(task_util, util_min, util_max, cpu)) > > + > > + fits = util_fits_cpu(task_util, util_min, util_max, cpu); > > + > > + /* This CPU fits with all capacity and performance requirements */ > > + if (fits > 0) > > return cpu; > > + /* > > + * Only the min performance (i.e. uclamp_min) doesn't fit. Look > > + * for the CPU with highest performance capacity. > > + */ > > + else if (fits < 0) > > + cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)); > > > > - if (cpu_cap > best_cap) { > > + /* > > + * First, select cpu which fits better (-1 being better than 0). > > + * Then, select the one with largest capacity at same level. > > + */ > > + if ((fits < best_fits) || > > + ((fits == best_fits) && (cpu_cap > best_cap))) { > > best_cap = cpu_cap; > > best_cpu = cpu; > > + best_fits = fits; > > } > > } > > > > @@ -6897,7 +6913,7 @@ static inline bool asym_fits_cpu(unsigned long util, > > int cpu) > > { > > if (sched_asym_cpucap_active()) > > - return util_fits_cpu(util, util_min, util_max, cpu); > > + return (util_fits_cpu(util, util_min, util_max, cpu) > 0); > > > > return true; > > } > > @@ -7264,6 +7280,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024; > > struct root_domain *rd = this_rq()->rd; > > int cpu, best_energy_cpu, target = -1; > > + int prev_fits = -1, best_fits = -1; > > + unsigned long best_thermal_cap = 0; > > + unsigned long prev_thermal_cap = 0; > > struct sched_domain *sd; > > struct perf_domain *pd; > > struct energy_env eenv; > > @@ -7299,6 +7318,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > unsigned long prev_spare_cap = 0; > > int max_spare_cap_cpu = -1; > > unsigned long base_energy; > > + int fits, max_fits = -1; > > > > cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); > > > > @@ -7351,7 +7371,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > util_max = max(rq_util_max, p_util_max); > > } > > } > > - if (!util_fits_cpu(util, util_min, util_max, cpu)) > > + > > + fits = util_fits_cpu(util, util_min, util_max, cpu); > > + if (!fits) > > continue; > > > > lsub_positive(&cpu_cap, util); > > @@ -7359,7 +7381,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > if (cpu == prev_cpu) { > > /* Always use prev_cpu as a candidate. */ > > prev_spare_cap = cpu_cap; > > - } else if (cpu_cap > max_spare_cap) { > > + prev_fits = fits; > > + } else if ((fits > max_fits) || > > + ((fits == max_fits) && (cpu_cap > max_spare_cap))) { > > /* > > * Find the CPU with the maximum spare capacity > > * among the remaining CPUs in the performance > > @@ -7367,6 +7391,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > */ > > max_spare_cap = cpu_cap; > > max_spare_cap_cpu = cpu; > > + max_fits = fits; > > } > > } > > > > @@ -7385,26 +7410,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > if (prev_delta < base_energy) > > goto unlock; > > prev_delta -= base_energy; > > + prev_thermal_cap = cpu_thermal_cap; > > best_delta = min(best_delta, prev_delta); > > } > > > > /* Evaluate the energy impact of using max_spare_cap_cpu. */ > > if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) { > > + /* Current best energy cpu fits better */ > > + if (max_fits < best_fits) > > + continue; > > + > > + /* > > + * Both don't fit performance (i.e. uclamp_min) but > > + * best energy cpu has better performance. > > + */ > > + if ((max_fits < 0) && > > + (cpu_thermal_cap <= best_thermal_cap)) > > + continue; > > + > > cur_delta = compute_energy(&eenv, pd, cpus, p, > > max_spare_cap_cpu); > > /* CPU utilization has changed */ > > if (cur_delta < base_energy) > > goto unlock; > > cur_delta -= base_energy; > > - if (cur_delta < best_delta) { > > - best_delta = cur_delta; > > - best_energy_cpu = max_spare_cap_cpu; > > - } > > + > > + /* > > + * Both fit for the task but best energy cpu has lower > > + * energy impact. > > + */ > > + if ((max_fits > 0) && > > Shouldn't this be > > if ((max_fits > 0) && (max_fits == best_fits) && > ? I will use the below which match better the comment and the fact that both fit for the task: + if ((max_fits > 0) && (best_fits > 0) && > > We should update best_delta unconditionally first time we hit max_fits = 1, no? > > I think it's worth extending the comment with something along the lines of > > * ... except for the first time max_fits becomes 1 > * then we must update best_delta unconditionally With the new condition above this is not needed anymore > > > + (cur_delta >= best_delta)) > > + continue; > > + > > + best_delta = cur_delta; > > + best_energy_cpu = max_spare_cap_cpu; > > + best_fits = max_fits; > > + best_thermal_cap = cpu_thermal_cap; > > } > > } > > rcu_read_unlock(); > > > > - if (best_delta < prev_delta) > > + if ((best_fits > prev_fits) || > > + ((best_fits > 0) && (best_delta < prev_delta)) || > > + ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap))) > > target = best_energy_cpu; > > Overall I think the approach is sound. I tested it on my pinebook pro and > couldn't catch obvious breakage at least. > > I am still worried though about spilling the knowledge outside of > util_fits_cpu() is creating extra complexity in the callers and potentially > more fragility when these callers evolve overtime e.g: > task_fits_cpu()/asym_fits_cpu() gain a new user that must actually care about > the -1 return value. ask_fits_cpu()/asym_fits_cpu() remain simple booleans that return true if the task fits the cpu in regards to all requirements. If a new user wants to make smarter decisions like select_idle_capacity() or feec(), it will have to use util_fits_cpu(). Both handle the case where uclamp_min is not met differently. > > I think we can still optimize the capacity inversion logic to use no loops > without having to spill the knowledge to the users/callers of util_fits_cpu(), > no? TBH, I don't know how because we are not always comparing the same things depending of the reason it doesn't fit > > That said except for the few comments I had this LGTM anyway. Thanks for your > effort! Thanks I'm going to prepare v2 > > > Cheers > > -- > Qais Yousef > > > > > return target; > > @@ -10228,24 +10277,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env) > > */ > > update_sd_lb_stats(env, &sds); > > > > - if (sched_energy_enabled()) { > > - struct root_domain *rd = env->dst_rq->rd; > > - > > - if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized)) > > - goto out_balanced; > > - } > > - > > - local = &sds.local_stat; > > - busiest = &sds.busiest_stat; > > - > > /* There is no busy sibling group to pull tasks from */ > > if (!sds.busiest) > > goto out_balanced; > > > > + busiest = &sds.busiest_stat; > > + > > /* Misfit tasks should be dealt with regardless of the avg load */ > > if (busiest->group_type == group_misfit_task) > > goto force_balance; > > > > + if (sched_energy_enabled()) { > > + struct root_domain *rd = env->dst_rq->rd; > > + > > + if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized)) > > + goto out_balanced; > > + } > > + > > /* ASYM feature bypasses nice load balance check */ > > if (busiest->group_type == group_asym_packing) > > goto force_balance; > > @@ -10258,6 +10306,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) > > if (busiest->group_type == group_imbalanced) > > goto force_balance; > > > > + local = &sds.local_stat; > > /* > > * If the local group is busier than the selected busiest group > > * don't try and pull any tasks. > > -- > > 2.17.1 > >
Hi, > By taking into account uclamp_min, the 1:1 relation between task misfit > and cpu overutilized is no more true as a task with a small util_avg of > may not may not fit a high capacity cpu because of uclamp_min constraint. > > Add a new state in util_fits_cpu() to reflect the case that task would fit > a CPU except for the uclamp_min hint which is a performance requirement. > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we > can use this new value to take additional action to select the best CPU > that doesn't match uclamp_min hint. I just wanted to flag some issues I noticed with this patch and the entire topic. I was testing this on a Pixel 6 with a 5.18 android-mainline kernel with all the relevant uclamp and CFS scheduling patches backported to it from mainline. From what I can see, the 'uclamp fits capacity' patchset introduced some alarming power usage & performance issues that this patch makes even worse. The patch stack for the following tables is as follows: (ufc_patched) sched/fair: unlink misfit task from cpu overutilized sched/uclamp: Fix a uninitialized variable warnings (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec() sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition sched/uclamp: Make cpu_overutilized() use util_fits_cpu() sched/uclamp: Make asym_fits_capacity() use util_fits_cpu() sched/uclamp: Make select_idle_capacity() use util_fits_cpu() sched/uclamp: Fix fits_capacity() check in feec() sched/uclamp: Make task_fits_capacity() use util_fits_cpu() sched/uclamp: Fix relationship between uclamp and migration margin (previous 'baseline' was here) I omitted the 3 patches relating directly to capacity_inversion but in the other tests I did with those there were similar issues. It's probably easier to consider the uclamp parts and their effects in isolation. 1. Geekbench 5 (performance regression) +-----------------+----------------------------+--------+-----------+ | metric | kernel | value | perc_diff | +-----------------+----------------------------+--------+-----------+ | multicore_score | baseline | 2765.4 | 0.0% | | multicore_score | baseline_ufc | 2704.3 | -2.21% | <-- a noticeable score decrease already | multicore_score | ufc_patched | 2443.2 | -11.65% | <-- a massive score decrease +-----------------+----------------------------+--------+-----------+ +--------------+--------+----------------------------+--------+-----------+ | chan_name | metric | kernel | value | perc_diff | +--------------+--------+----------------------------+--------+-----------+ | total_power | gmean | baseline | 2664.0 | 0.0% | | total_power | gmean | baseline_ufc | 2621.5 | -1.6% | <-- worse performance per watt | total_power | gmean | ufc_patched | 2601.2 | -2.36% | <-- much worse performance per watt +--------------+--------+----------------------------+--------+-----------+ The most likely cause for the regression seen above is the decrease in the amount of time spent while overutilized with these patches. Maximising overutilization for GB5 is the desired outcome as the benchmark for almost its entire duration keeps either 1 core or all the cores completely saturated so EAS cannot be effective. These patches have the opposite from the desired effect in this area. +----------------------------+--------------------+--------------------+------------+ | kernel | time | total_time | percentage | +----------------------------+--------------------+--------------------+------------+ | baseline | 121.979 | 181.065 | 67.46 | | baseline_ufc | 120.355 | 184.255 | 65.32 | | ufc_patched | 60.715 | 196.135 | 30.98 | <-- !!! +----------------------------+--------------------+--------------------+------------+ 2. Jankbench (power usage regression) +--------+---------------+---------------------------------+-------+-----------+ | metric | variable | kernel | value | perc_diff | +--------+---------------+---------------------------------+-------+-----------+ | gmean | mean_duration | baseline_60hz | 14.6 | 0.0% | | gmean | mean_duration | baseline_ufc_60hz | 15.2 | 3.83% | | gmean | mean_duration | ufc_patched_60hz | 14.0 | -4.12% | +--------+---------------+---------------------------------+-------+-----------+ +--------+-----------+---------------------------------+-------+-----------+ | metric | variable | kernel | value | perc_diff | +--------+-----------+---------------------------------+-------+-----------+ | gmean | jank_perc | baseline_60hz | 1.9 | 0.0% | | gmean | jank_perc | baseline_ufc_60hz | 2.2 | 15.39% | | gmean | jank_perc | ufc_patched_60hz | 2.0 | 3.61% | +--------+-----------+---------------------------------+-------+-----------+ +--------------+--------+---------------------------------+-------+-----------+ | chan_name | metric | kernel | value | perc_diff | +--------------+--------+---------------------------------+-------+-----------+ | total_power | gmean | baseline_60hz | 135.9 | 0.0% | | total_power | gmean | baseline_ufc_60hz | 155.7 | 14.61% | <-- !!! | total_power | gmean | ufc_patched_60hz | 157.1 | 15.63% | <-- !!! +--------------+--------+---------------------------------+-------+-----------+ With these patches while running Jankbench we use up ~15% more power just to achieve roughly the same results. Here I'm not sure where this issue is coming from exactly but all the results above are very consistent across different runs. > -- > 2.17.1 > >
Hi Kajetan, On Fri, 13 Jan 2023 at 14:50, Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > Hi, > > > By taking into account uclamp_min, the 1:1 relation between task misfit > > and cpu overutilized is no more true as a task with a small util_avg of > > may not may not fit a high capacity cpu because of uclamp_min constraint. > > > > Add a new state in util_fits_cpu() to reflect the case that task would fit > > a CPU except for the uclamp_min hint which is a performance requirement. > > > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we > > can use this new value to take additional action to select the best CPU > > that doesn't match uclamp_min hint. > > I just wanted to flag some issues I noticed with this patch and the > entire topic. > > I was testing this on a Pixel 6 with a 5.18 android-mainline kernel with Do you have more details to share on your setup ? Android kernel has some hack on top of the mainline. Do you use some ? Then, the perf and the power can be largely impacted by the cgroup configuration. Have you got details on your setup ? I'm going to try to reproduce the behavior > all the relevant uclamp and CFS scheduling patches backported to it from > mainline. From what I can see, the 'uclamp fits capacity' patchset > introduced some alarming power usage & performance issues that this > patch makes even worse. > > The patch stack for the following tables is as follows: > > (ufc_patched) sched/fair: unlink misfit task from cpu overutilized I just sent a v3 which fixes a condition. Wonder if this could have an impact on the results both perf and power > sched/uclamp: Fix a uninitialized variable warnings > (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec() > sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition > sched/uclamp: Make cpu_overutilized() use util_fits_cpu() > sched/uclamp: Make asym_fits_capacity() use util_fits_cpu() > sched/uclamp: Make select_idle_capacity() use util_fits_cpu() > sched/uclamp: Fix fits_capacity() check in feec() > sched/uclamp: Make task_fits_capacity() use util_fits_cpu() > sched/uclamp: Fix relationship between uclamp and migration margin > (previous 'baseline' was here) > > I omitted the 3 patches relating directly to capacity_inversion but in > the other tests I did with those there were similar issues. It's > probably easier to consider the uclamp parts and their effects in > isolation. > > 1. Geekbench 5 (performance regression) > > +-----------------+----------------------------+--------+-----------+ > | metric | kernel | value | perc_diff | > +-----------------+----------------------------+--------+-----------+ > | multicore_score | baseline | 2765.4 | 0.0% | > | multicore_score | baseline_ufc | 2704.3 | -2.21% | <-- a noticeable score decrease already > | multicore_score | ufc_patched | 2443.2 | -11.65% | <-- a massive score decrease > +-----------------+----------------------------+--------+-----------+ > > +--------------+--------+----------------------------+--------+-----------+ > | chan_name | metric | kernel | value | perc_diff | > +--------------+--------+----------------------------+--------+-----------+ > | total_power | gmean | baseline | 2664.0 | 0.0% | > | total_power | gmean | baseline_ufc | 2621.5 | -1.6% | <-- worse performance per watt > | total_power | gmean | ufc_patched | 2601.2 | -2.36% | <-- much worse performance per watt > +--------------+--------+----------------------------+--------+-----------+ > > The most likely cause for the regression seen above is the decrease in the amount of > time spent while overutilized with these patches. Maximising > overutilization for GB5 is the desired outcome as the benchmark for > almost its entire duration keeps either 1 core or all the cores > completely saturated so EAS cannot be effective. These patches have the > opposite from the desired effect in this area. > > +----------------------------+--------------------+--------------------+------------+ > | kernel | time | total_time | percentage | > +----------------------------+--------------------+--------------------+------------+ > | baseline | 121.979 | 181.065 | 67.46 | > | baseline_ufc | 120.355 | 184.255 | 65.32 | > | ufc_patched | 60.715 | 196.135 | 30.98 | <-- !!! > +----------------------------+--------------------+--------------------+------------+ I'm not surprised because some use cases which were not overutilized were wrongly triggered as overutilized so switching back to performance mode. You might have to tune the uclamp value > > 2. Jankbench (power usage regression) > > +--------+---------------+---------------------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+---------------+---------------------------------+-------+-----------+ > | gmean | mean_duration | baseline_60hz | 14.6 | 0.0% | > | gmean | mean_duration | baseline_ufc_60hz | 15.2 | 3.83% | > | gmean | mean_duration | ufc_patched_60hz | 14.0 | -4.12% | > +--------+---------------+---------------------------------+-------+-----------+ > > +--------+-----------+---------------------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+-----------+---------------------------------+-------+-----------+ > | gmean | jank_perc | baseline_60hz | 1.9 | 0.0% | > | gmean | jank_perc | baseline_ufc_60hz | 2.2 | 15.39% | > | gmean | jank_perc | ufc_patched_60hz | 2.0 | 3.61% | > +--------+-----------+---------------------------------+-------+-----------+ > > +--------------+--------+---------------------------------+-------+-----------+ > | chan_name | metric | kernel | value | perc_diff | > +--------------+--------+---------------------------------+-------+-----------+ > | total_power | gmean | baseline_60hz | 135.9 | 0.0% | > | total_power | gmean | baseline_ufc_60hz | 155.7 | 14.61% | <-- !!! > | total_power | gmean | ufc_patched_60hz | 157.1 | 15.63% | <-- !!! > +--------------+--------+---------------------------------+-------+-----------+ > > With these patches while running Jankbench we use up ~15% more power > just to achieve roughly the same results. Here I'm not sure where this > issue is coming from exactly but all the results above are very consistent > across different runs. > > > -- > > 2.17.1 > > > >
> > I was testing this on a Pixel 6 with a 5.18 android-mainline kernel with > Do you have more details to share on your setup ? > Android kernel has some hack on top of the mainline. Do you use some ? > Then, the perf and the power can be largely impacted by the cgroup > configuration. Have you got details on your setup ? The kernel I use has all the vendor hooks and hacks switched off to keep it as close to mainline as possible. Unfortunately 5.18 was the last mainline that worked on this device due to some driver issues so we just backport mainline scheduling patches as they come out to at least keep the scheduler itself up to date. > I just sent a v3 which fixes a condition. Wonder if this could have an > impact on the results both perf and power I don't think it'll fix the GB5 score side of it as that's clearly related to overutilization while the condition changed in v3 is inside the non-OU section of feec(). I'll still test the v3 on the weekend if I have some free time. The power usage issue was already introduced in the uclamp fits capacity patchset that's been merged so I doubt this change will be enough to account for it but I'll give it a try regardless. > > The most likely cause for the regression seen above is the decrease in the amount of > > time spent while overutilized with these patches. Maximising > > overutilization for GB5 is the desired outcome as the benchmark for > > almost its entire duration keeps either 1 core or all the cores > > completely saturated so EAS cannot be effective. These patches have the > > opposite from the desired effect in this area. > > > > +----------------------------+--------------------+--------------------+------------+ > > | kernel | time | total_time | percentage | > > +----------------------------+--------------------+--------------------+------------+ > > | baseline | 121.979 | 181.065 | 67.46 | > > | baseline_ufc | 120.355 | 184.255 | 65.32 | > > | ufc_patched | 60.715 | 196.135 | 30.98 | <-- !!! > > +----------------------------+--------------------+--------------------+------------+ > > I'm not surprised because some use cases which were not overutilized > were wrongly triggered as overutilized so switching back to > performance mode. You might have to tune the uclamp value But they'd be wrongly triggered with the 'baseline_ufc' variant while not with the 'baseline' variant. The baseline here is pre taking uclamp into account for cpu_overutilized, all cpu_overutilized did on that kernel was compare util against capacity. Meaning that the 'real' overutilised would be in the ~67% ballpark while the patch makes it incorrectly not trigger more than half the time. I'm not sure we can tweak uclamp enough to fix that. > > > > 2. Jankbench (power usage regression) > > > > +--------+---------------+---------------------------------+-------+-----------+ > > | metric | variable | kernel | value | perc_diff | > > +--------+---------------+---------------------------------+-------+-----------+ > > | gmean | mean_duration | baseline_60hz | 14.6 | 0.0% | > > | gmean | mean_duration | baseline_ufc_60hz | 15.2 | 3.83% | > > | gmean | mean_duration | ufc_patched_60hz | 14.0 | -4.12% | > > +--------+---------------+---------------------------------+-------+-----------+ > > > > +--------+-----------+---------------------------------+-------+-----------+ > > | metric | variable | kernel | value | perc_diff | > > +--------+-----------+---------------------------------+-------+-----------+ > > | gmean | jank_perc | baseline_60hz | 1.9 | 0.0% | > > | gmean | jank_perc | baseline_ufc_60hz | 2.2 | 15.39% | > > | gmean | jank_perc | ufc_patched_60hz | 2.0 | 3.61% | > > +--------+-----------+---------------------------------+-------+-----------+ > > > > +--------------+--------+---------------------------------+-------+-----------+ > > | chan_name | metric | kernel | value | perc_diff | > > +--------------+--------+---------------------------------+-------+-----------+ > > | total_power | gmean | baseline_60hz | 135.9 | 0.0% | > > | total_power | gmean | baseline_ufc_60hz | 155.7 | 14.61% | <-- !!! > > | total_power | gmean | ufc_patched_60hz | 157.1 | 15.63% | <-- !!! > > +--------------+--------+---------------------------------+-------+-----------+ > > > > With these patches while running Jankbench we use up ~15% more power > > just to achieve roughly the same results. Here I'm not sure where this > > issue is coming from exactly but all the results above are very consistent > > across different runs. > > > > > -- > > > 2.17.1 > > > > > >
On 01/13/23 09:53, Vincent Guittot wrote: > On Thu, 12 Jan 2023 at 15:26, Qais Yousef <qyousef@layalina.io> wrote: > > > > Hi Vincent > > > > On 12/28/22 17:54, Vincent Guittot wrote: > > > By taking into account uclamp_min, the 1:1 relation between task misfit > > > and cpu overutilized is no more true as a task with a small util_avg of > > > may not may not fit a high capacity cpu because of uclamp_min constraint. > > > > Wouldn't it be better to split this into two patches > > > > * Unlink/Decouple misfit ... > > * Unlink/Decouple util_fits_cpu from HMP > > > > ? > > I'm afraid that git bisect could then raise a false positive between > the 2 commits They should be independent, no? Anyway, I don't feel that strongly about it but as a minor nit the commit subject could be better. > > > > > > > > > Add a new state in util_fits_cpu() to reflect the case that task would fit > > > a CPU except for the uclamp_min hint which is a performance requirement. > > > > > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we > > > can use this new value to take additional action to select the best CPU > > > that doesn't match uclamp_min hint. > > > > This part has nothing to do with the commit subject. I think it's better to > > split the patches if it's not too much work for you. > > > > > > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > > --- > > > > > > Change since v1: > > > - fix some wrong conditions > > > - take into account more cases > > > > > > kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++------------ > > > 1 file changed, 74 insertions(+), 25 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 1649e7d71d24..57077f0a897e 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -4582,8 +4582,7 @@ static inline int util_fits_cpu(unsigned long util, > > > * 2. The system is being saturated when we're operating near > > > * max capacity, it doesn't make sense to block overutilized. > > > */ > > > - uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE); > > > - uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig); > > > + uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE); > > > fits = fits || uclamp_max_fits; > > > > > > /* > > > @@ -4618,8 +4617,8 @@ static inline int util_fits_cpu(unsigned long util, > > > * handle the case uclamp_min > uclamp_max. > > > */ > > > uclamp_min = min(uclamp_min, uclamp_max); > > > - if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE) > > > - fits = fits && (uclamp_min <= capacity_orig_thermal); > > > + if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal)) > > > + return -1; > > > > > > return fits; > > > } > > > @@ -4629,7 +4628,7 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu) > > > unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN); > > > unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX); > > > unsigned long util = task_util_est(p); > > > - return util_fits_cpu(util, uclamp_min, uclamp_max, cpu); > > > + return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0); > > > > So the big difference between your approach and my approach is that > > task_fits_cpu() and asym_fits_cpu() now are very strict in regards to thermal > > pressure since with your approach we delegate the smartness to the caller. > > > > Should we add a comment for these 2 users to make it obvious we intentionally > > ignore the '-1' value and why it is okay? > > I can probably add something saying that a positive value (1 in this > case) is the only one that says that a task fits to a cpu. Other > returned values imply that either the utilization or the uclamp > constraints are not meet Sounds good to me. So at least whoever looks at this later and doesn't know the history will at least try to dig more before using it as-is and assume wonders. > > > > I'm not sure I can write a reasonable rationale myself. I'm actually worried > > this might subtly break decisions made by select_idle_capacity() or feec() when > > doing the LB. > > > > Have you considered this? > > Yes, that why i have keep the changes in 1 patch Okay I think I see what you're on about now. [...] > > > + /* > > > + * Both fit for the task but best energy cpu has lower > > > + * energy impact. > > > + */ > > > + if ((max_fits > 0) && > > > > Shouldn't this be > > > > if ((max_fits > 0) && (max_fits == best_fits) && > > ? > > I will use the below which match better the comment and the fact that > both fit for the task: > > + if ((max_fits > 0) && (best_fits > 0) && > Fine by me. > > > > We should update best_delta unconditionally first time we hit max_fits = 1, no? > > > > I think it's worth extending the comment with something along the lines of > > > > * ... except for the first time max_fits becomes 1 > > * then we must update best_delta unconditionally > > With the new condition above this is not needed anymore +1 > > > > > > + (cur_delta >= best_delta)) > > > + continue; > > > + > > > + best_delta = cur_delta; > > > + best_energy_cpu = max_spare_cap_cpu; > > > + best_fits = max_fits; > > > + best_thermal_cap = cpu_thermal_cap; > > > } > > > } > > > rcu_read_unlock(); > > > > > > - if (best_delta < prev_delta) > > > + if ((best_fits > prev_fits) || > > > + ((best_fits > 0) && (best_delta < prev_delta)) || > > > + ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap))) > > > target = best_energy_cpu; > > > > Overall I think the approach is sound. I tested it on my pinebook pro and > > couldn't catch obvious breakage at least. > > > > I am still worried though about spilling the knowledge outside of > > util_fits_cpu() is creating extra complexity in the callers and potentially > > more fragility when these callers evolve overtime e.g: > > task_fits_cpu()/asym_fits_cpu() gain a new user that must actually care about > > the -1 return value. > > ask_fits_cpu()/asym_fits_cpu() remain simple booleans that return true > if the task fits the cpu in regards to all requirements. If a new user > wants to make smarter decisions like select_idle_capacity() or feec(), > it will have to use util_fits_cpu(). Both handle the case where > uclamp_min is not met differently. I think with that comment added we can hope this will promote new users to think more. > > > > > I think we can still optimize the capacity inversion logic to use no loops > > without having to spill the knowledge to the users/callers of util_fits_cpu(), > > no? > > TBH, I don't know how because we are not always comparing the same > things depending of the reason it doesn't fit Your patch will do a more comprehensive search, true. It will try harder to find a best fit. I do have a bit of a bias that some severe thermal conditions are indication of a bigger problem to try harder to honour uclamp_min (warrant the additional complexity). But I thought more about it and I think it actually might be worth while. Xuewen had a system where medium's capacity was 900 and bigs could easily be inverted with some thermal pressure on mediums too. So this harder search can help these systems to keep these tasks on mediums. e.g: if uclamp_min was 900 and big is inverted and medium is under pressure my approach would have returned false for all CPUs then; while yours will keep it on medium cpus. > > > > > That said except for the few comments I had this LGTM anyway. Thanks for your > > effort! > > Thanks > > I'm going to prepare v2 Thanks! -- Qais Yousef
On 01/13/23 15:28, Vincent Guittot wrote: > Hi Kajetan, > > On Fri, 13 Jan 2023 at 14:50, Kajetan Puchalski > <kajetan.puchalski@arm.com> wrote: > > > > Hi, > > > > > By taking into account uclamp_min, the 1:1 relation between task misfit > > > and cpu overutilized is no more true as a task with a small util_avg of > > > may not may not fit a high capacity cpu because of uclamp_min constraint. > > > > > > Add a new state in util_fits_cpu() to reflect the case that task would fit > > > a CPU except for the uclamp_min hint which is a performance requirement. > > > > > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we > > > can use this new value to take additional action to select the best CPU > > > that doesn't match uclamp_min hint. > > > > I just wanted to flag some issues I noticed with this patch and the > > entire topic. > > > > I was testing this on a Pixel 6 with a 5.18 android-mainline kernel with > > Do you have more details to share on your setup ? > Android kernel has some hack on top of the mainline. Do you use some ? > Then, the perf and the power can be largely impacted by the cgroup > configuration. Have you got details on your setup ? > > I'm going to try to reproduce the behavior > > > all the relevant uclamp and CFS scheduling patches backported to it from > > mainline. From what I can see, the 'uclamp fits capacity' patchset > > introduced some alarming power usage & performance issues that this > > patch makes even worse. > > > > The patch stack for the following tables is as follows: > > > > (ufc_patched) sched/fair: unlink misfit task from cpu overutilized > > I just sent a v3 which fixes a condition. Wonder if this could have an > impact on the results both perf and power > > > sched/uclamp: Fix a uninitialized variable warnings > > (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec() > > sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition > > sched/uclamp: Make cpu_overutilized() use util_fits_cpu() > > sched/uclamp: Make asym_fits_capacity() use util_fits_cpu() > > sched/uclamp: Make select_idle_capacity() use util_fits_cpu() > > sched/uclamp: Fix fits_capacity() check in feec() > > sched/uclamp: Make task_fits_capacity() use util_fits_cpu() > > sched/uclamp: Fix relationship between uclamp and migration margin > > (previous 'baseline' was here) > > > > I omitted the 3 patches relating directly to capacity_inversion but in This could lead to confusion. Was there a specific reason for this omission? Did you hit some problem? > > the other tests I did with those there were similar issues. It's > > probably easier to consider the uclamp parts and their effects in > > isolation. > > > > 1. Geekbench 5 (performance regression) > > > > +-----------------+----------------------------+--------+-----------+ > > | metric | kernel | value | perc_diff | > > +-----------------+----------------------------+--------+-----------+ > > | multicore_score | baseline | 2765.4 | 0.0% | > > | multicore_score | baseline_ufc | 2704.3 | -2.21% | <-- a noticeable score decrease already > > | multicore_score | ufc_patched | 2443.2 | -11.65% | <-- a massive score decrease > > +-----------------+----------------------------+--------+-----------+ > > > > +--------------+--------+----------------------------+--------+-----------+ > > | chan_name | metric | kernel | value | perc_diff | > > +--------------+--------+----------------------------+--------+-----------+ > > | total_power | gmean | baseline | 2664.0 | 0.0% | > > | total_power | gmean | baseline_ufc | 2621.5 | -1.6% | <-- worse performance per watt > > | total_power | gmean | ufc_patched | 2601.2 | -2.36% | <-- much worse performance per watt > > +--------------+--------+----------------------------+--------+-----------+ Hmm I think I've seen a better score but at the cost of more energy in my testing in the past. > > > > The most likely cause for the regression seen above is the decrease in the amount of > > time spent while overutilized with these patches. Maximising > > overutilization for GB5 is the desired outcome as the benchmark for > > almost its entire duration keeps either 1 core or all the cores > > completely saturated so EAS cannot be effective. These patches have the > > opposite from the desired effect in this area. > > > > +----------------------------+--------------------+--------------------+------------+ > > | kernel | time | total_time | percentage | > > +----------------------------+--------------------+--------------------+------------+ > > | baseline | 121.979 | 181.065 | 67.46 | > > | baseline_ufc | 120.355 | 184.255 | 65.32 | > > | ufc_patched | 60.715 | 196.135 | 30.98 | <-- !!! > > +----------------------------+--------------------+--------------------+------------+ > > I'm not surprised because some use cases which were not overutilized > were wrongly triggered as overutilized so switching back to > performance mode. You might have to tune the uclamp value I remember there were tasks with uclamp_min=512 or something like that in the system. I wonder if they're making the difference here - they will steal time from bigger cores and increase energy too. The big jump with Vincent patches is strange though. How many iterations do you run? How long do you wait between each iteration? The original behavior of overutilized in regards to util_avg shouldn't have changed. It's worth digging a bit more into it. I looked at my previous results and I was seeing ~57% on android12-5.10 kernel for both with and without the patch. > > > > > 2. Jankbench (power usage regression) > > > > +--------+---------------+---------------------------------+-------+-----------+ > > | metric | variable | kernel | value | perc_diff | > > +--------+---------------+---------------------------------+-------+-----------+ > > | gmean | mean_duration | baseline_60hz | 14.6 | 0.0% | > > | gmean | mean_duration | baseline_ufc_60hz | 15.2 | 3.83% | > > | gmean | mean_duration | ufc_patched_60hz | 14.0 | -4.12% | > > +--------+---------------+---------------------------------+-------+-----------+ > > > > +--------+-----------+---------------------------------+-------+-----------+ > > | metric | variable | kernel | value | perc_diff | > > +--------+-----------+---------------------------------+-------+-----------+ > > | gmean | jank_perc | baseline_60hz | 1.9 | 0.0% | > > | gmean | jank_perc | baseline_ufc_60hz | 2.2 | 15.39% | > > | gmean | jank_perc | ufc_patched_60hz | 2.0 | 3.61% | > > +--------+-----------+---------------------------------+-------+-----------+ How many iterations did you run? Do you think they could be within the noise region? > > > > +--------------+--------+---------------------------------+-------+-----------+ > > | chan_name | metric | kernel | value | perc_diff | > > +--------------+--------+---------------------------------+-------+-----------+ > > | total_power | gmean | baseline_60hz | 135.9 | 0.0% | > > | total_power | gmean | baseline_ufc_60hz | 155.7 | 14.61% | <-- !!! > > | total_power | gmean | ufc_patched_60hz | 157.1 | 15.63% | <-- !!! > > +--------------+--------+---------------------------------+-------+-----------+ > > > > With these patches while running Jankbench we use up ~15% more power > > just to achieve roughly the same results. Here I'm not sure where this > > issue is coming from exactly but all the results above are very consistent > > across different runs. Have you tried to look at uclamp_min/max values of the tasks/cpus? Do you know which cluster ended up using more energy? Have you looked at freq residency between the runs? I think the system could dynamically boost some UI tasks. I'd expect their residency to increase on the medium/big cores compared to before the patch. Which could explain what you see. Did you check your schedutil/rate_limit_us is not using the default 10ms? It should be 500us. I had another patch to set transition latency of the cpufreq driver to 500us instead of 5ms - but I doubt this actually was making any difference. FWIW, I did compare my series against vanilla Pixel 6 kernel where I used geekbench, speedometer, pcmark all with and without heavy background activities to measure the impact of uclamp_max and nothing stood out then. I sadly lost my original setup now. I doubt I'll be able to recreate it to re-run these tests again anytime soon :/ Could you try removing all thermal handling from util_fits_cpu() so that my series behaves like v1 again and see if this makes a difference? It's highlight if subtle issues with thermal pressure handling are the culprit. Most obvious is using instantaneous thermal pressure in ufc(). Thanks! -- Qais Yousef
On 14/01/2023 22:18, Qais Yousef wrote: > On 01/13/23 15:28, Vincent Guittot wrote: >> Hi Kajetan, >> >> On Fri, 13 Jan 2023 at 14:50, Kajetan Puchalski >> <kajetan.puchalski@arm.com> wrote: [...] >>> sched/uclamp: Fix a uninitialized variable warnings >>> (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec() >>> sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition >>> sched/uclamp: Make cpu_overutilized() use util_fits_cpu() >>> sched/uclamp: Make asym_fits_capacity() use util_fits_cpu() >>> sched/uclamp: Make select_idle_capacity() use util_fits_cpu() >>> sched/uclamp: Fix fits_capacity() check in feec() >>> sched/uclamp: Make task_fits_capacity() use util_fits_cpu() >>> sched/uclamp: Fix relationship between uclamp and migration margin >>> (previous 'baseline' was here) >>> >>> I omitted the 3 patches relating directly to capacity_inversion but in > > This could lead to confusion. Was there a specific reason for this omission? > Did you hit some problem? I thought Vincent's 'split MF from CPU OU' should replace 'CapInv'? https://lkml.kernel.org/r/20221209164739.GA24368@vingu-book ... I have reverted patches: Revert "sched/fair: Detect capacity inversion" Revert "sched/fair: Consider capacity inversion in util_fits_cpu()" Revert "sched/uclamp: Make cpu_overutilized() use util_fits_cpu()" ... You also mentioned this in https://lkml.kernel.org/r/<20230115001906.v7uq4ddodrbvye7d@airbuntu [...]
On Mon, 16 Jan 2023 at 10:21, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 14/01/2023 22:18, Qais Yousef wrote: > > On 01/13/23 15:28, Vincent Guittot wrote: > >> Hi Kajetan, > >> > >> On Fri, 13 Jan 2023 at 14:50, Kajetan Puchalski > >> <kajetan.puchalski@arm.com> wrote: > > [...] > > >>> sched/uclamp: Fix a uninitialized variable warnings > >>> (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec() > >>> sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition > >>> sched/uclamp: Make cpu_overutilized() use util_fits_cpu() > >>> sched/uclamp: Make asym_fits_capacity() use util_fits_cpu() > >>> sched/uclamp: Make select_idle_capacity() use util_fits_cpu() > >>> sched/uclamp: Fix fits_capacity() check in feec() > >>> sched/uclamp: Make task_fits_capacity() use util_fits_cpu() > >>> sched/uclamp: Fix relationship between uclamp and migration margin > >>> (previous 'baseline' was here) > >>> > >>> I omitted the 3 patches relating directly to capacity_inversion but in > > > > This could lead to confusion. Was there a specific reason for this omission? > > Did you hit some problem? > > I thought Vincent's 'split MF from CPU OU' should replace 'CapInv'? Current patch v3 applies on top of tip/sched/core which includes 'CapInv'. The end goal being that CapInv becomes useless and can be removed > > https://lkml.kernel.org/r/20221209164739.GA24368@vingu-book > > ... I have reverted patches: > Revert "sched/fair: Detect capacity inversion" > Revert "sched/fair: Consider capacity inversion in util_fits_cpu()" > Revert "sched/uclamp: Make cpu_overutilized() use util_fits_cpu()" ... > > You also mentioned this in > https://lkml.kernel.org/r/<20230115001906.v7uq4ddodrbvye7d@airbuntu > > [...]
On 01/16/23 09:21, Dietmar Eggemann wrote: > On 14/01/2023 22:18, Qais Yousef wrote: > > On 01/13/23 15:28, Vincent Guittot wrote: > >> Hi Kajetan, > >> > >> On Fri, 13 Jan 2023 at 14:50, Kajetan Puchalski > >> <kajetan.puchalski@arm.com> wrote: > > [...] > > >>> sched/uclamp: Fix a uninitialized variable warnings > >>> (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec() > >>> sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition > >>> sched/uclamp: Make cpu_overutilized() use util_fits_cpu() > >>> sched/uclamp: Make asym_fits_capacity() use util_fits_cpu() > >>> sched/uclamp: Make select_idle_capacity() use util_fits_cpu() > >>> sched/uclamp: Fix fits_capacity() check in feec() > >>> sched/uclamp: Make task_fits_capacity() use util_fits_cpu() > >>> sched/uclamp: Fix relationship between uclamp and migration margin > >>> (previous 'baseline' was here) > >>> > >>> I omitted the 3 patches relating directly to capacity_inversion but in > > > > This could lead to confusion. Was there a specific reason for this omission? > > Did you hit some problem? > > I thought Vincent's 'split MF from CPU OU' should replace 'CapInv'? Ah I thought my version was tested with these removed. Got ya now. Cheers -- Qais Yousef > > https://lkml.kernel.org/r/20221209164739.GA24368@vingu-book > > ... I have reverted patches: > Revert "sched/fair: Detect capacity inversion" > Revert "sched/fair: Consider capacity inversion in util_fits_cpu()" > Revert "sched/uclamp: Make cpu_overutilized() use util_fits_cpu()" ... > > You also mentioned this in > https://lkml.kernel.org/r/<20230115001906.v7uq4ddodrbvye7d@airbuntu > > [...]
On Sat, Jan 14, 2023 at 09:18:54PM +0000, Qais Yousef wrote: [...] > I remember there were tasks with uclamp_min=512 or something like that in the > system. I wonder if they're making the difference here - they will steal time > from bigger cores and increase energy too. > > The big jump with Vincent patches is strange though. How many iterations do you > run? How long do you wait between each iteration? It's 3 iterations for GB5 and 10 for Jankbench, all back to back after keeping the phone in a fridge for an hour. This methodology has proven to give pretty consistent results so far. > The original behavior of overutilized in regards to util_avg shouldn't have > changed. It's worth digging a bit more into it. > > I looked at my previous results and I was seeing ~57% on android12-5.10 kernel > for both with and without the patch. As you said in the v3 thread this is caused by the "no OU on big cores" issue. I'll reply in the v4 thread once I have a more complete set of numbers but safe to say removing that change fixed the drop in score. +-----------------+-------------------------+--------+-----------+ | metric | kernel | value | perc_diff | +-----------------+-------------------------+--------+-----------+ | multicore_score | baseline | 2765.4 | 0.0% | | multicore_score | baseline_ufc | 2704.3 | -2.21% | | multicore_score | ufc_patched_v4 | 2839.8 | 2.69% | +-----------------+-------------------------+--------+-----------+ > > > > > > > > 2. Jankbench (power usage regression) > > > > > > +--------+---------------+---------------------------------+-------+-----------+ > > > | metric | variable | kernel | value | perc_diff | > > > +--------+---------------+---------------------------------+-------+-----------+ > > > | gmean | mean_duration | baseline_60hz | 14.6 | 0.0% | > > > | gmean | mean_duration | baseline_ufc_60hz | 15.2 | 3.83% | > > > | gmean | mean_duration | ufc_patched_60hz | 14.0 | -4.12% | > > > +--------+---------------+---------------------------------+-------+-----------+ > > > > > > +--------+-----------+---------------------------------+-------+-----------+ > > > | metric | variable | kernel | value | perc_diff | > > > +--------+-----------+---------------------------------+-------+-----------+ > > > | gmean | jank_perc | baseline_60hz | 1.9 | 0.0% | > > > | gmean | jank_perc | baseline_ufc_60hz | 2.2 | 15.39% | > > > | gmean | jank_perc | ufc_patched_60hz | 2.0 | 3.61% | > > > +--------+-----------+---------------------------------+-------+-----------+ > > How many iterations did you run? Do you think they could be within the noise > region? As mentioned, 10 iterations on a cooled down phone. The frame durations/jank above could be vaguely within noise levels, yes. I'd read it as "no major change". This is mainly included for contrast with the power usage spike. > > > > > > +--------------+--------+---------------------------------+-------+-----------+ > > > | chan_name | metric | kernel | value | perc_diff | > > > +--------------+--------+---------------------------------+-------+-----------+ > > > | total_power | gmean | baseline_60hz | 135.9 | 0.0% | > > > | total_power | gmean | baseline_ufc_60hz | 155.7 | 14.61% | <-- !!! > > > | total_power | gmean | ufc_patched_60hz | 157.1 | 15.63% | <-- !!! > > > +--------------+--------+---------------------------------+-------+-----------+ > > > > > > With these patches while running Jankbench we use up ~15% more power > > > just to achieve roughly the same results. Here I'm not sure where this > > > issue is coming from exactly but all the results above are very consistent > > > across different runs. > > Have you tried to look at uclamp_min/max values of the tasks/cpus? > > Do you know which cluster ended up using more energy? Have you looked at freq > residency between the runs? +--------------+--------+------------------------------+-------+-----------+ | chan_name | metric | kernel | value | perc_diff | +--------------+--------+------------------------------+-------+-----------+ | little_power | gmean | baseline_60hz | 69.8 | 0.0% | | little_power | gmean | baseline_ufc_60hz | 74.2 | 6.29% | | mid_power | gmean | baseline_60hz | 20.1 | 0.0% | | mid_power | gmean | baseline_ufc_60hz | 22.6 | 12.73% | | big_power | gmean | baseline_60hz | 45.9 | 0.0% | | big_power | gmean | baseline_ufc_60hz | 58.7 | 27.68% | | total_power | gmean | baseline_60hz | 135.9 | 0.0% | | total_power | gmean | baseline_ufc_60hz | 155.7 | 14.61% | +--------------+--------+------------------------------+-------+-----------+ All 3 clusters end up using more power, numerically & proportionally the biggest increase is on the big cores. If we're looking for anything specific I can dig into freq residency, it's just 3 clusters * 10 runs * 40 different OPPs is quite a lot when it comes to trying to find a clear trend.. > I think the system could dynamically boost some UI tasks. I'd expect their > residency to increase on the medium/big cores compared to before the patch. > Which could explain what you see. > > Did you check your schedutil/rate_limit_us is not using the default 10ms? It > should be 500us. I do most tests with the mainline default of 10ms. I did a set of runs with 500us just to compare what impacts it has and I got worse Jankbench results (mean duration 15.2 -> 16.9 and jank% 2.2 -> 3.7) with roughly the same power usage (155.7 -> 156.4). It did help with the GB5 score decrease though it was still below the original baseline. +-----------------+-------------------------+--------+-----------+ | metric | kernel | value | perc_diff | +-----------------+-------------------------+--------+-----------+ | multicore_score | baseline | 2765.4 | 0.0% | | multicore_score | baseline_ufc | 2704.3 | -2.21% | | multicore_score | baseline_ufc_su500 | 2759.0 | -0.23% | +-----------------+-------------------------+--------+-----------+ > I had another patch to set transition latency of the cpufreq driver to 500us > instead of 5ms - but I doubt this actually was making any difference. > > FWIW, I did compare my series against vanilla Pixel 6 kernel where I used > geekbench, speedometer, pcmark all with and without heavy background activities > to measure the impact of uclamp_max and nothing stood out then. Maybe comparing it against device kernels could be causing some issues? Wouldn't they be pretty far from mainline to begin with and then get even further with vendor hooks etc? There could be plenty of side effects and interactions that come out differently. > I sadly lost my original setup now. I doubt I'll be able to recreate it to > re-run these tests again anytime soon :/ > > Could you try removing all thermal handling from util_fits_cpu() so that my > series behaves like v1 again and see if this makes a difference? It's highlight > if subtle issues with thermal pressure handling are the culprit. Most obvious > is using instantaneous thermal pressure in ufc(). This where the interesting stuff comes in, looks like it is the thermal handling causing problems, yes. +-----------------+-------------------------+--------+-----------+ | metric | kernel | value | perc_diff | +-----------------+-------------------------+--------+-----------+ | multicore_score | baseline | 2765.4 | 0.0% | | multicore_score | baseline_ufc | 2704.3 | -2.21% | | multicore_score | baseline_ufc_no_thermal | 2870.8 | 3.81% | <-- 170 pt difference +-----------------+-------------------------+--------+-----------+ +--------------+--------+-------------------------+--------+-----------+ | chan_name | metric | kernel | value | perc_diff | +--------------+--------+-------------------------+--------+-----------+ | total_power | gmean | baseline | 2664.0 | 0.0% | | total_power | gmean | baseline_ufc | 2621.5 | -1.6% | | total_power | gmean | baseline_ufc_no_thermal | 2710.0 | 1.73% | <-- better PPW +--------------+--------+-------------------------+--------+-----------+ The Jankbench frame numbers are just below the original baseline but very much within the noise levels so I'd say "no change" for brevity. +--------------+--------+------------------------------+-------+-----------+ | chan_name | metric | kernel | value | perc_diff | +--------------+--------+------------------------------+-------+-----------+ | total_power | gmean | baseline_60hz | 135.9 | 0.0% | | total_power | gmean | baseline_ufc_60hz | 155.7 | 14.61% | | total_power | gmean | baseline_ufc_no_thermal_60hz | 134.5 | -1.01% | <-- power saving! +--------------+--------+------------------------------+-------+-----------+ This is the diff I got the previous results from: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e5f7c48950f8..2b846a7e2ed3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4262,7 +4262,7 @@ static inline int util_fits_cpu(unsigned long util, unsigned long uclamp_max, int cpu) { - unsigned long capacity_orig, capacity_orig_thermal; + unsigned long capacity_orig; unsigned long capacity = capacity_of(cpu); bool fits, uclamp_max_fits; @@ -4299,7 +4299,6 @@ static inline int util_fits_cpu(unsigned long util, * the time. */ capacity_orig = capacity_orig_of(cpu); - capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu); /* * We want to force a task to fit a cpu as implied by uclamp_max. @@ -4375,7 +4374,7 @@ static inline int util_fits_cpu(unsigned long util, */ uclamp_min = min(uclamp_min, uclamp_max); if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE) - fits = fits && (uclamp_min <= capacity_orig_thermal); + fits = fits && (uclamp_min <= capacity_orig); return fits; } All in all taking the thermal handling out gives us better scores, better PPW and better power usage on the whole in Jankbench so there's clearly some issue with it. Since you wrote in the v3 thread that we can't use thermal_load_avg() maybe we should just take this part out completely for the time being until there's a better solution? I could send a patch with the above diff. --- Kajetan
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1649e7d71d24..57077f0a897e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4582,8 +4582,7 @@ static inline int util_fits_cpu(unsigned long util, * 2. The system is being saturated when we're operating near * max capacity, it doesn't make sense to block overutilized. */ - uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE); - uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig); + uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE); fits = fits || uclamp_max_fits; /* @@ -4618,8 +4617,8 @@ static inline int util_fits_cpu(unsigned long util, * handle the case uclamp_min > uclamp_max. */ uclamp_min = min(uclamp_min, uclamp_max); - if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE) - fits = fits && (uclamp_min <= capacity_orig_thermal); + if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal)) + return -1; return fits; } @@ -4629,7 +4628,7 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu) unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN); unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX); unsigned long util = task_util_est(p); - return util_fits_cpu(util, uclamp_min, uclamp_max, cpu); + return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0); } static inline void update_misfit_status(struct task_struct *p, struct rq *rq) @@ -6864,6 +6863,7 @@ static int select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) { unsigned long task_util, util_min, util_max, best_cap = 0; + int fits, best_fits = 0; int cpu, best_cpu = -1; struct cpumask *cpus; @@ -6879,12 +6879,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu)) continue; - if (util_fits_cpu(task_util, util_min, util_max, cpu)) + + fits = util_fits_cpu(task_util, util_min, util_max, cpu); + + /* This CPU fits with all capacity and performance requirements */ + if (fits > 0) return cpu; + /* + * Only the min performance (i.e. uclamp_min) doesn't fit. Look + * for the CPU with highest performance capacity. + */ + else if (fits < 0) + cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)); - if (cpu_cap > best_cap) { + /* + * First, select cpu which fits better (-1 being better than 0). + * Then, select the one with largest capacity at same level. + */ + if ((fits < best_fits) || + ((fits == best_fits) && (cpu_cap > best_cap))) { best_cap = cpu_cap; best_cpu = cpu; + best_fits = fits; } } @@ -6897,7 +6913,7 @@ static inline bool asym_fits_cpu(unsigned long util, int cpu) { if (sched_asym_cpucap_active()) - return util_fits_cpu(util, util_min, util_max, cpu); + return (util_fits_cpu(util, util_min, util_max, cpu) > 0); return true; } @@ -7264,6 +7280,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024; struct root_domain *rd = this_rq()->rd; int cpu, best_energy_cpu, target = -1; + int prev_fits = -1, best_fits = -1; + unsigned long best_thermal_cap = 0; + unsigned long prev_thermal_cap = 0; struct sched_domain *sd; struct perf_domain *pd; struct energy_env eenv; @@ -7299,6 +7318,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) unsigned long prev_spare_cap = 0; int max_spare_cap_cpu = -1; unsigned long base_energy; + int fits, max_fits = -1; cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); @@ -7351,7 +7371,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) util_max = max(rq_util_max, p_util_max); } } - if (!util_fits_cpu(util, util_min, util_max, cpu)) + + fits = util_fits_cpu(util, util_min, util_max, cpu); + if (!fits) continue; lsub_positive(&cpu_cap, util); @@ -7359,7 +7381,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) if (cpu == prev_cpu) { /* Always use prev_cpu as a candidate. */ prev_spare_cap = cpu_cap; - } else if (cpu_cap > max_spare_cap) { + prev_fits = fits; + } else if ((fits > max_fits) || + ((fits == max_fits) && (cpu_cap > max_spare_cap))) { /* * Find the CPU with the maximum spare capacity * among the remaining CPUs in the performance @@ -7367,6 +7391,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) */ max_spare_cap = cpu_cap; max_spare_cap_cpu = cpu; + max_fits = fits; } } @@ -7385,26 +7410,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) if (prev_delta < base_energy) goto unlock; prev_delta -= base_energy; + prev_thermal_cap = cpu_thermal_cap; best_delta = min(best_delta, prev_delta); } /* Evaluate the energy impact of using max_spare_cap_cpu. */ if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) { + /* Current best energy cpu fits better */ + if (max_fits < best_fits) + continue; + + /* + * Both don't fit performance (i.e. uclamp_min) but + * best energy cpu has better performance. + */ + if ((max_fits < 0) && + (cpu_thermal_cap <= best_thermal_cap)) + continue; + cur_delta = compute_energy(&eenv, pd, cpus, p, max_spare_cap_cpu); /* CPU utilization has changed */ if (cur_delta < base_energy) goto unlock; cur_delta -= base_energy; - if (cur_delta < best_delta) { - best_delta = cur_delta; - best_energy_cpu = max_spare_cap_cpu; - } + + /* + * Both fit for the task but best energy cpu has lower + * energy impact. + */ + if ((max_fits > 0) && + (cur_delta >= best_delta)) + continue; + + best_delta = cur_delta; + best_energy_cpu = max_spare_cap_cpu; + best_fits = max_fits; + best_thermal_cap = cpu_thermal_cap; } } rcu_read_unlock(); - if (best_delta < prev_delta) + if ((best_fits > prev_fits) || + ((best_fits > 0) && (best_delta < prev_delta)) || + ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap))) target = best_energy_cpu; return target; @@ -10228,24 +10277,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds); - if (sched_energy_enabled()) { - struct root_domain *rd = env->dst_rq->rd; - - if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized)) - goto out_balanced; - } - - local = &sds.local_stat; - busiest = &sds.busiest_stat; - /* There is no busy sibling group to pull tasks from */ if (!sds.busiest) goto out_balanced; + busiest = &sds.busiest_stat; + /* Misfit tasks should be dealt with regardless of the avg load */ if (busiest->group_type == group_misfit_task) goto force_balance; + if (sched_energy_enabled()) { + struct root_domain *rd = env->dst_rq->rd; + + if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized)) + goto out_balanced; + } + /* ASYM feature bypasses nice load balance check */ if (busiest->group_type == group_asym_packing) goto force_balance; @@ -10258,6 +10306,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) if (busiest->group_type == group_imbalanced) goto force_balance; + local = &sds.local_stat; /* * If the local group is busier than the selected busiest group * don't try and pull any tasks.
By taking into account uclamp_min, the 1:1 relation between task misfit and cpu overutilized is no more true as a task with a small util_avg of may not may not fit a high capacity cpu because of uclamp_min constraint. Add a new state in util_fits_cpu() to reflect the case that task would fit a CPU except for the uclamp_min hint which is a performance requirement. Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we can use this new value to take additional action to select the best CPU that doesn't match uclamp_min hint. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- Change since v1: - fix some wrong conditions - take into account more cases kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++------------ 1 file changed, 74 insertions(+), 25 deletions(-)