Message ID | 1536306664-29827-1-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
Series | sched/fair: fix load_balance redo for null imbalance | expand |
On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote: > It can happen that load_balance finds a busiest group and then a busiest rq > but the calculated imbalance is in fact null. Cute. Does that happen often? > If the calculated imbalance is null, it's useless to try to find a busiest > rq as no task will be migrated and we can return immediately. > > This situation can happen with heterogeneous system or smp system when RT > tasks are decreasing the capacity of some CPUs. Is it the result of one of those "force_balance" conditions in find_busiest_group() ? Should we not fix that to then return NULL instead?
Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit : > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote: > > It can happen that load_balance finds a busiest group and then a busiest rq > > but the calculated imbalance is in fact null. > > Cute. Does that happen often? I have a use case with RT tasks that reproduces the problem regularly. It happens at least when we have CPUs with different capacity either because of heterogeous CPU or because of RT/DL reducing available capacity for cfs I have put the call path that trigs the problem below and accroding to the comment it seems that we can reach similar state when playing with priority. > > > If the calculated imbalance is null, it's useless to try to find a busiest > > rq as no task will be migrated and we can return immediately. > > > > This situation can happen with heterogeneous system or smp system when RT > > tasks are decreasing the capacity of some CPUs. > > Is it the result of one of those "force_balance" conditions in > find_busiest_group() ? Should we not fix that to then return NULL > instead? The UC is: We have a newly_idle load balance that is triggered when RT task becomes idle ( but I think that I have seen that with idle load balance too) we trigs: if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) && busiest->group_no_capacity) goto force_balance; In calculate_imbalance we use the path /* * Avg load of busiest sg can be less and avg load of local sg can * be greater than avg load across all sgs of sd because avg load * factors in sg capacity and sgs with smaller group_type are * skipped when updating the busiest sg: */ if (busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load) { env->imbalance = 0; return fix_small_imbalance(env, sds); } but fix_small_imbalance finally decides to return without modifying imbalance like here if (busiest->avg_load + scaled_busy_load_per_task >= local->avg_load + (scaled_busy_load_per_task * imbn)) { env->imbalance = busiest->load_per_task; return; } Beside this patch, I'm preparing another patch in fix small imbalance to ensure 1 task per CPU in similar situation but according to the comment above, we can reach this situation because of tasks priority
On Fri, 7 Sep 2018 at 14:35, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit : > > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote: > > > It can happen that load_balance finds a busiest group and then a busiest rq > > > but the calculated imbalance is in fact null. > > > > Cute. Does that happen often? > > I have a use case with RT tasks that reproduces the problem regularly. > It happens at least when we have CPUs with different capacity either because > of heterogeous CPU or because of RT/DL reducing available capacity for cfs > I have put the call path that trigs the problem below and accroding to the > comment it seems that we can reach similar state when playing with priority. > > > > > > If the calculated imbalance is null, it's useless to try to find a busiest > > > rq as no task will be migrated and we can return immediately. > > > > > > This situation can happen with heterogeneous system or smp system when RT > > > tasks are decreasing the capacity of some CPUs. > > > > Is it the result of one of those "force_balance" conditions in > > find_busiest_group() ? Should we not fix that to then return NULL > > instead? > > The UC is: > We have a newly_idle load balance that is triggered when RT task becomes idle > ( but I think that I have seen that with idle load balance too) > > we trigs: > if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) && > busiest->group_no_capacity) > goto force_balance; > > In calculate_imbalance we use the path > /* > * Avg load of busiest sg can be less and avg load of local sg can > * be greater than avg load across all sgs of sd because avg load > * factors in sg capacity and sgs with smaller group_type are > * skipped when updating the busiest sg: > */ > if (busiest->avg_load <= sds->avg_load || > local->avg_load >= sds->avg_load) { > env->imbalance = 0; > return fix_small_imbalance(env, sds); > } > > but fix_small_imbalance finally decides to return without modifying imbalance > like here > if (busiest->avg_load + scaled_busy_load_per_task >= > local->avg_load + (scaled_busy_load_per_task * imbn)) { > env->imbalance = busiest->load_per_task; > return; > } > > Beside this patch, I'm preparing another patch in fix small imbalance to > ensure 1 task per CPU in similar situation but according to the comment above, > we can reach this situation because of tasks priority I have just done a quick test on my smp hikey board (dual quad core arm64) by adding a log in dmesg each time we have the condition busiest != null and imbalance == 0. The log happens from time to time when I generate some activity on the baord like syncing the filesystem before running a test. But I don't have the details. The logs happen with and without the next patch that I mentioned above. So it probably means that we can trig this situation with other UCs >
On Fri, Sep 07, 2018 at 02:35:51PM +0200, Vincent Guittot wrote: > Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit : > > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote: > > > It can happen that load_balance finds a busiest group and then a busiest rq > > > but the calculated imbalance is in fact null. > > > > Cute. Does that happen often? > > I have a use case with RT tasks that reproduces the problem regularly. > It happens at least when we have CPUs with different capacity either because > of heterogeous CPU or because of RT/DL reducing available capacity for cfs > I have put the call path that trigs the problem below and accroding to the > comment it seems that we can reach similar state when playing with priority. > > > > > > If the calculated imbalance is null, it's useless to try to find a busiest > > > rq as no task will be migrated and we can return immediately. > > > > > > This situation can happen with heterogeneous system or smp system when RT > > > tasks are decreasing the capacity of some CPUs. > > > > Is it the result of one of those "force_balance" conditions in > > find_busiest_group() ? Should we not fix that to then return NULL > > instead? > > The UC is: > We have a newly_idle load balance that is triggered when RT task becomes idle > ( but I think that I have seen that with idle load balance too) > > we trigs: > if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) && > busiest->group_no_capacity) > goto force_balance; > > In calculate_imbalance we use the path > /* > * Avg load of busiest sg can be less and avg load of local sg can > * be greater than avg load across all sgs of sd because avg load > * factors in sg capacity and sgs with smaller group_type are > * skipped when updating the busiest sg: > */ > if (busiest->avg_load <= sds->avg_load || > local->avg_load >= sds->avg_load) { > env->imbalance = 0; > return fix_small_imbalance(env, sds); > } > > but fix_small_imbalance finally decides to return without modifying imbalance > like here > if (busiest->avg_load + scaled_busy_load_per_task >= > local->avg_load + (scaled_busy_load_per_task * imbn)) { > env->imbalance = busiest->load_per_task; > return; > } That one actually does modify imbalance :-) But I get your point. > Beside this patch, I'm preparing another patch in fix small imbalance to > ensure 1 task per CPU in similar situation but according to the comment above, > we can reach this situation because of tasks priority Didn't we all hate fix_small_imbalance() ? Anyway, I think I'd prefer something like the below; although it might be nicer to thread the return value through calculate_imbalance() and fix_small_imbalance(), but looking at them that's not going to be particularly nicer. Do you agree with this?, If so, I'll stick your orignal Changelog on it and pretend this is what you send me :-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b39fb596f6c1..0596a29f3d2a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8269,7 +8269,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) force_balance: /* Looks like there is an imbalance. Compute it */ calculate_imbalance(env, &sds); - return sds.busiest; + return env->imbalance ? sds.busiest : NULL; out_balanced: env->imbalance = 0;
On Fri, Sep 07, 2018 at 02:55:54PM +0200, Peter Zijlstra wrote: > > Beside this patch, I'm preparing another patch in fix small imbalance to > > ensure 1 task per CPU in similar situation but according to the comment above, > > we can reach this situation because of tasks priority > > Didn't we all hate fix_small_imbalance() ? That is, all that load_per_task business is complete rubbish in this cgroup world.
On Fri, 7 Sep 2018 at 14:56, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Sep 07, 2018 at 02:35:51PM +0200, Vincent Guittot wrote: > > Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit : > > > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote: > > > > It can happen that load_balance finds a busiest group and then a busiest rq > > > > but the calculated imbalance is in fact null. > > > > > > Cute. Does that happen often? > > > > I have a use case with RT tasks that reproduces the problem regularly. > > It happens at least when we have CPUs with different capacity either because > > of heterogeous CPU or because of RT/DL reducing available capacity for cfs > > I have put the call path that trigs the problem below and accroding to the > > comment it seems that we can reach similar state when playing with priority. > > > > > > > > > If the calculated imbalance is null, it's useless to try to find a busiest > > > > rq as no task will be migrated and we can return immediately. > > > > > > > > This situation can happen with heterogeneous system or smp system when RT > > > > tasks are decreasing the capacity of some CPUs. > > > > > > Is it the result of one of those "force_balance" conditions in > > > find_busiest_group() ? Should we not fix that to then return NULL > > > instead? > > > > The UC is: > > We have a newly_idle load balance that is triggered when RT task becomes idle > > ( but I think that I have seen that with idle load balance too) > > > > we trigs: > > if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) && > > busiest->group_no_capacity) > > goto force_balance; > > > > In calculate_imbalance we use the path > > /* > > * Avg load of busiest sg can be less and avg load of local sg can > > * be greater than avg load across all sgs of sd because avg load > > * factors in sg capacity and sgs with smaller group_type are > > * skipped when updating the busiest sg: > > */ > > if (busiest->avg_load <= sds->avg_load || > > local->avg_load >= sds->avg_load) { > > env->imbalance = 0; > > return fix_small_imbalance(env, sds); > > } > > > > but fix_small_imbalance finally decides to return without modifying imbalance > > like here > > if (busiest->avg_load + scaled_busy_load_per_task >= > > local->avg_load + (scaled_busy_load_per_task * imbn)) { > > env->imbalance = busiest->load_per_task; > > return; > > } > > That one actually does modify imbalance :-) But I get your point. > > > Beside this patch, I'm preparing another patch in fix small imbalance to > > ensure 1 task per CPU in similar situation but according to the comment above, > > we can reach this situation because of tasks priority > > Didn't we all hate fix_small_imbalance() ? yes. the rational of some decisions in the function are more and more opaque to me. For now I'm trying to fix all UCs that I can find to then try to get a generic rule As an example, I have some situations (other than those discussed here) in fix_small_imbalance where the task migration decision mainly depends of a variation of +/-5 in the load of a task. This finally generates good fairness between tasks but the root cause of this fairness is a bit random. > > Anyway, I think I'd prefer something like the below; although it might > be nicer to thread the return value through calculate_imbalance() and > fix_small_imbalance(), but looking at them that's not going to be > particularly nicer. > > Do you agree with this?, If so, I'll stick your orignal Changelog on it > and pretend this is what you send me :-) That's fine to me :-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b39fb596f6c1..0596a29f3d2a 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8269,7 +8269,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) > force_balance: > /* Looks like there is an imbalance. Compute it */ > calculate_imbalance(env, &sds); > - return sds.busiest; > + return env->imbalance ? sds.busiest : NULL; > > out_balanced: > env->imbalance = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 309c93f..224bfae 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8464,7 +8464,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, } group = find_busiest_group(&env); - if (!group) { + if (!group || !env.imbalance) { schedstat_inc(sd->lb_nobusyg[idle]); goto out_balanced; }
It can happen that load_balance finds a busiest group and then a busiest rq but the calculated imbalance is in fact null. In such situation, detach_tasks returns immediately and lets the flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to have pinned tasks and removed from the load balance mask. then, we redo a load balance without the busiest CPU. This creates wrong load balance situation and generates wrong task migration. If the calculated imbalance is null, it's useless to try to find a busiest rq as no task will be migrated and we can return immediately. This situation can happen with heterogeneous system or smp system when RT tasks are decreasing the capacity of some CPUs. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.7.4