Message ID | 1564670424-26023-9-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
Series | sched/fair: rework the CFS load balance | expand |
On 01/08/2019 15:40, Vincent Guittot wrote: > @@ -8261,7 +8261,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, > * If we have more than one misfit sg go with the > * biggest misfit. > */ > - if (sgs->group_misfit_task_load < busiest->group_misfit_task_load) > + if (sgs->group_misfit_task_util < busiest->group_misfit_task_util) > return false; I previously said this change would render the maximization useless, but I had forgotten one thing: with PELT time scaling, task utilization can go above its CPU's capacity. So if you have two LITTLE CPUs running a busy loop (misfit task) each, the one that's been running the longest would have the highest utilization (assuming they haven't reached 1024 yet). In other words "utilizations above the capacity_margin can't be compared" doesn't really stand. Still, maximizing load would lead us there. Furthermore, if we have to pick between two rqs with misfit tasks, I still believe we should pick the one with the highest load, not the highest utilization. We could keep load and fix the issue of detaching the wrong task with something like: -----8<----- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 53e64a7b2ae0..bfc2b624ee98 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7489,12 +7489,8 @@ static int detach_tasks(struct lb_env *env) case migrate_misfit: load = task_h_load(p); - /* - * utilization of misfit task might decrease a bit - * since it has been recorded. Be conservative in the - * condition. - */ - if (load < env->imbalance) + /* This is not a misfit task */ + if (task_fits_capacity(p, capacity_of(env->src_cpu))) goto next; env->imbalance = 0; ----->8----- However what would be *even* better IMO would be: -----8<----- @@ -8853,6 +8853,7 @@ voluntary_active_balance(struct lb_env *env) return 1; } + /* XXX: make sure current is still a misfit task? */ if (env->balance_type == migrate_misfit) return 1; @@ -8966,6 +8967,20 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.src_rq = busiest; ld_moved = 0; + + /* + * Misfit tasks are only misfit if they are currently running, see + * update_misfit_status(). + * + * - If they're not running, we'll get an opportunity at wakeup to + * migrate them to a bigger CPU. + * - If they're running, we need to active balance them to a bigger CPU. + * + * Do the smart thing and go straight for active balance. + */ + if (env->balance_type == migrate_misfit) + goto active_balance; + if (busiest->nr_running > 1) { /* * Attempt to move tasks. If find_busiest_group has found @@ -9074,7 +9089,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out_all_pinned; } } - +active_balance: if (!ld_moved) { schedstat_inc(sd->lb_failed[idle]); /* ----->8-----
On Thu, 1 Aug 2019 at 18:27, Valentin Schneider <valentin.schneider@arm.com> wrote: > > On 01/08/2019 15:40, Vincent Guittot wrote: > > @@ -8261,7 +8261,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, > > * If we have more than one misfit sg go with the > > * biggest misfit. > > */ > > - if (sgs->group_misfit_task_load < busiest->group_misfit_task_load) > > + if (sgs->group_misfit_task_util < busiest->group_misfit_task_util) > > return false; > > I previously said this change would render the maximization useless, but I > had forgotten one thing: with PELT time scaling, task utilization can go > above its CPU's capacity. > > So if you have two LITTLE CPUs running a busy loop (misfit task) each, the > one that's been running the longest would have the highest utilization > (assuming they haven't reached 1024 yet). In other words "utilizations > above the capacity_margin can't be compared" doesn't really stand. > > Still, maximizing load would lead us there. Furthermore, if we have to pick > between two rqs with misfit tasks, I still believe we should pick the one > with the highest load, not the highest utilization. > > We could keep load and fix the issue of detaching the wrong task with > something like: > > -----8<----- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 53e64a7b2ae0..bfc2b624ee98 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7489,12 +7489,8 @@ static int detach_tasks(struct lb_env *env) > case migrate_misfit: > load = task_h_load(p); > > - /* > - * utilization of misfit task might decrease a bit > - * since it has been recorded. Be conservative in the > - * condition. > - */ > - if (load < env->imbalance) > + /* This is not a misfit task */ > + if (task_fits_capacity(p, capacity_of(env->src_cpu))) > goto next; This could be a solution for make sure to pull only misfit task and keep using load > > env->imbalance = 0; > ----->8----- > > However what would be *even* better IMO would be: > > -----8<----- > @@ -8853,6 +8853,7 @@ voluntary_active_balance(struct lb_env *env) > return 1; > } > > + /* XXX: make sure current is still a misfit task? */ > if (env->balance_type == migrate_misfit) > return 1; > > @@ -8966,6 +8967,20 @@ static int load_balance(int this_cpu, struct rq *this_rq, > env.src_rq = busiest; > > ld_moved = 0; > + > + /* > + * Misfit tasks are only misfit if they are currently running, see > + * update_misfit_status(). > + * > + * - If they're not running, we'll get an opportunity at wakeup to > + * migrate them to a bigger CPU. > + * - If they're running, we need to active balance them to a bigger CPU. > + * > + * Do the smart thing and go straight for active balance. > + */ > + if (env->balance_type == migrate_misfit) > + goto active_balance; > + This looks ugly and add a new bypass which this patchset tries to remove This doesn't work if your misfit task has been preempted by another one during the load balance and waiting for the runqueue > if (busiest->nr_running > 1) { > /* > * Attempt to move tasks. If find_busiest_group has found > @@ -9074,7 +9089,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, > goto out_all_pinned; > } > } > - > +active_balance: > if (!ld_moved) { > schedstat_inc(sd->lb_failed[idle]); > /* > ----->8-----
On 02/08/2019 09:29, Vincent Guittot wrote: >> However what would be *even* better IMO would be: >> >> -----8<----- >> @@ -8853,6 +8853,7 @@ voluntary_active_balance(struct lb_env *env) >> return 1; >> } >> >> + /* XXX: make sure current is still a misfit task? */ >> if (env->balance_type == migrate_misfit) >> return 1; >> >> @@ -8966,6 +8967,20 @@ static int load_balance(int this_cpu, struct rq *this_rq, >> env.src_rq = busiest; >> >> ld_moved = 0; >> + >> + /* >> + * Misfit tasks are only misfit if they are currently running, see >> + * update_misfit_status(). >> + * >> + * - If they're not running, we'll get an opportunity at wakeup to >> + * migrate them to a bigger CPU. >> + * - If they're running, we need to active balance them to a bigger CPU. >> + * >> + * Do the smart thing and go straight for active balance. >> + */ >> + if (env->balance_type == migrate_misfit) >> + goto active_balance; >> + > > This looks ugly and add a new bypass which this patchset tries to remove > This doesn't work if your misfit task has been preempted by another > one during the load balance and waiting for the runqueue I won't comment on aesthetics, but when it comes to preempted misfit tasks do consider that a task *has* to have a utilization >= 80% of its CPU's capacity to be flagged as misfit. If it's a busy loop, it can only be preempted ~20% of the time to still be flagged as a misfit task, so going straight for active balance will be the right thing to do in the majority of cases. What we gain from doing that is we save ourselves from (potentially needlessly) iterating over the rq's tasks. That's less work and less rq lock contention. To put a bit of contrast on this, Qais did some profiling of usual mobile workloads on a regular 4+4 big.LITTLE smartphone and IIRC the rq depth rose very rarely above 5, although the tail did reach ~100 tasks. So most of the time it would be fine to go through the detach_tasks() path. This all deserves some actual benchmarking, I'll give it a shot. >> if (busiest->nr_running > 1) { >> /* >> * Attempt to move tasks. If find_busiest_group has found >> @@ -9074,7 +9089,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, >> goto out_all_pinned; >> } >> } >> - >> +active_balance: >> if (!ld_moved) { >> schedstat_inc(sd->lb_failed[idle]); >> /* >> ----->8-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 53e64a7..d08cc12 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3817,16 +3817,16 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) return; if (!p) { - rq->misfit_task_load = 0; + rq->misfit_task_util = 0; return; } if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) { - rq->misfit_task_load = 0; + rq->misfit_task_util = 0; return; } - rq->misfit_task_load = task_h_load(p); + rq->misfit_task_util = task_util_est(p); } #else /* CONFIG_SMP */ @@ -7487,14 +7487,14 @@ static int detach_tasks(struct lb_env *env) break; case migrate_misfit: - load = task_h_load(p); + util = task_util_est(p); /* * utilization of misfit task might decrease a bit * since it has been recorded. Be conservative in the * condition. */ - if (load < env->imbalance) + if (2*util < env->imbalance) goto next; env->imbalance = 0; @@ -7785,7 +7785,7 @@ struct sg_lb_stats { unsigned int group_weight; enum group_type group_type; unsigned int group_asym_capacity; /* tasks should be move to preferred cpu */ - unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */ + unsigned long group_misfit_task_util; /* A CPU has a task too big for its capacity */ #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -7959,7 +7959,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd) */ static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd) { - return rq->misfit_task_load && + return rq->misfit_task_util && (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity || check_cpu_capacity(rq, sd)); } @@ -8078,7 +8078,7 @@ group_type group_classify(struct lb_env *env, if (sgs->group_asym_capacity) return group_asym_capacity; - if (sgs->group_misfit_task_load) + if (sgs->group_misfit_task_util) return group_misfit_task; if (!group_has_capacity(env, sgs)) @@ -8164,8 +8164,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, /* Check for a misfit task on the cpu */ if (env->sd->flags & SD_ASYM_CPUCAPACITY && - sgs->group_misfit_task_load < rq->misfit_task_load) { - sgs->group_misfit_task_load = rq->misfit_task_load; + sgs->group_misfit_task_util < rq->misfit_task_util) { + sgs->group_misfit_task_util = rq->misfit_task_util; *sg_status |= SG_OVERLOAD; } } @@ -8261,7 +8261,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, * If we have more than one misfit sg go with the * biggest misfit. */ - if (sgs->group_misfit_task_load < busiest->group_misfit_task_load) + if (sgs->group_misfit_task_util < busiest->group_misfit_task_util) return false; break; @@ -8458,7 +8458,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s if (busiest->group_type == group_misfit_task) { /* Set imbalance to allow misfit task to be balanced. */ env->balance_type = migrate_misfit; - env->imbalance = busiest->group_misfit_task_load; + env->imbalance = busiest->group_misfit_task_util; return; } @@ -8801,8 +8801,8 @@ static struct rq *find_busiest_queue(struct lb_env *env, * For ASYM_CPUCAPACITY domains with misfit tasks we simply * seek the "biggest" misfit task. */ - if (rq->misfit_task_load > busiest_load) { - busiest_load = rq->misfit_task_load; + if (rq->misfit_task_util > busiest_util) { + busiest_util = rq->misfit_task_util; busiest = rq; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7583fad..ef6e1b2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -916,7 +916,7 @@ struct rq { unsigned char idle_balance; - unsigned long misfit_task_load; + unsigned long misfit_task_util; /* For active balancing */ int active_balance;
utilization is used to detect a misfit task but the load is then used to select the task on the CPU which can lead to select a small task with high weight instead of the task that triggered the misfit migration. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 28 ++++++++++++++-------------- kernel/sched/sched.h | 2 +- 2 files changed, 15 insertions(+), 15 deletions(-) -- 2.7.4