[v2,8/8] sched/fair: use utilization to select misfit task

Message ID	1564670424-26023-9-git-send-email-vincent.guittot@linaro.org
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Vincent Guittot <vincent.guittot@linaro.org> To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org Cc: pauld@redhat.com, valentin.schneider@arm.com, srikar@linux.vnet.ibm.com, quentin.perret@arm.com, dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, Vincent Guittot <vincent.guittot@linaro.org> Subject: [PATCH v2 8/8] sched/fair: use utilization to select misfit task Date: Thu, 1 Aug 2019 16:40:24 +0200 Message-Id: <1564670424-26023-9-git-send-email-vincent.guittot@linaro.org> In-Reply-To: <1564670424-26023-1-git-send-email-vincent.guittot@linaro.org> References: <1564670424-26023-1-git-send-email-vincent.guittot@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	sched/fair: rework the CFS load balance \| expand [v2,0/8] sched/fair: rework the CFS load balance [v2,1/8] sched/fair: clean up asym packing [v2,2/8] sched/fair: rename sum_nr_running to sum_h_nr_running [v2,3/8] sched/fair: remove meaningless imbalance calculation [v2,4/8] sched/fair: rework load_balance [v2,5/8] sched/fair: use rq->nr_running when balancing load [v2,6/8] sched/fair: use load instead of runnable load [v2,7/8] sched/fair: evenly spread tasks when not overloaded [v2,8/8] sched/fair: use utilization to select misfit task

Message ID

1564670424-26023-9-git-send-email-vincent.guittot@linaro.org

State

New

Headers

Received-SPF: pass (google.com: best guess record for domain of
	linux-kernel-owner@vger.kernel.org designates 209.132.180.67
	as permitted sender) client-ip=209.132.180.67; 
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
	srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
	dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
	Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v2 8/8] sched/fair: use utilization to select misfit task
Date: Thu,  1 Aug 2019 16:40:24 +0200
Message-Id: <1564670424-26023-9-git-send-email-vincent.guittot@linaro.org>
In-Reply-To: <1564670424-26023-1-git-send-email-vincent.guittot@linaro.org>
References: <1564670424-26023-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Series

sched/fair: rework the CFS load balance | expand

Commit Message

Vincent Guittot Aug. 1, 2019, 2:40 p.m. UTC

utilization is used to detect a misfit task but the load is then used to
select the task on the CPU which can lead to select a small task with
high weight instead of the task that triggered the misfit migration.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c  | 28 ++++++++++++++--------------
 kernel/sched/sched.h |  2 +-
 2 files changed, 15 insertions(+), 15 deletions(-)

-- 
2.7.4

Comments

Valentin Schneider Aug. 1, 2019, 4:27 p.m. UTC | #1

On 01/08/2019 15:40, Vincent Guittot wrote:
> @@ -8261,7 +8261,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,

>  		 * If we have more than one misfit sg go with the

>  		 * biggest misfit.

>  		 */

> -		if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)

> +		if (sgs->group_misfit_task_util < busiest->group_misfit_task_util)

>  			return false;


I previously said this change would render the maximization useless, but I
had forgotten one thing: with PELT time scaling, task utilization can go
above its CPU's capacity.

So if you have two LITTLE CPUs running a busy loop (misfit task) each, the
one that's been running the longest would have the highest utilization
(assuming they haven't reached 1024 yet). In other words "utilizations
above the capacity_margin can't be compared" doesn't really stand.

Still, maximizing load would lead us there. Furthermore, if we have to pick
between two rqs with misfit tasks, I still believe we should pick the one
with the highest load, not the highest utilization.

We could keep load and fix the issue of detaching the wrong task with
something like:

-----8<-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53e64a7b2ae0..bfc2b624ee98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7489,12 +7489,8 @@ static int detach_tasks(struct lb_env *env)
                case migrate_misfit:
                        load = task_h_load(p);
 
-                       /*
-                        * utilization of misfit task might decrease a bit
-                        * since it has been recorded. Be conservative in the
-                        * condition.
-                        */
-                       if (load < env->imbalance)
+                       /* This is not a misfit task */
+                       if (task_fits_capacity(p, capacity_of(env->src_cpu)))
                                goto next;
 
                        env->imbalance = 0;
----->8-----

However what would be *even* better IMO would be:

-----8<-----
@@ -8853,6 +8853,7 @@ voluntary_active_balance(struct lb_env *env)
                        return 1;
        }
 
+       /* XXX: make sure current is still a misfit task? */
        if (env->balance_type == migrate_misfit)
                return 1;
 
@@ -8966,6 +8967,20 @@ static int load_balance(int this_cpu, struct rq *this_rq,
        env.src_rq = busiest;
 
        ld_moved = 0;
+
+       /*
+        * Misfit tasks are only misfit if they are currently running, see
+        * update_misfit_status().
+        *
+        * - If they're not running, we'll get an opportunity at wakeup to
+        *   migrate them to a bigger CPU.
+        * - If they're running, we need to active balance them to a bigger CPU.
+        *
+        * Do the smart thing and go straight for active balance.
+        */
+       if (env->balance_type == migrate_misfit)
+               goto active_balance;
+
        if (busiest->nr_running > 1) {
                /*
                 * Attempt to move tasks. If find_busiest_group has found
@@ -9074,7 +9089,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
                        goto out_all_pinned;
                }
        }
-
+active_balance:
        if (!ld_moved) {
                schedstat_inc(sd->lb_failed[idle]);
                /*
----->8-----

Vincent Guittot Aug. 2, 2019, 8:29 a.m. UTC | #2

On Thu, 1 Aug 2019 at 18:27, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>

> On 01/08/2019 15:40, Vincent Guittot wrote:

> > @@ -8261,7 +8261,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,

> >                * If we have more than one misfit sg go with the

> >                * biggest misfit.

> >                */

> > -             if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)

> > +             if (sgs->group_misfit_task_util < busiest->group_misfit_task_util)

> >                       return false;

>

> I previously said this change would render the maximization useless, but I

> had forgotten one thing: with PELT time scaling, task utilization can go

> above its CPU's capacity.

>

> So if you have two LITTLE CPUs running a busy loop (misfit task) each, the

> one that's been running the longest would have the highest utilization

> (assuming they haven't reached 1024 yet). In other words "utilizations

> above the capacity_margin can't be compared" doesn't really stand.

>

> Still, maximizing load would lead us there. Furthermore, if we have to pick

> between two rqs with misfit tasks, I still believe we should pick the one

> with the highest load, not the highest utilization.

>

> We could keep load and fix the issue of detaching the wrong task with

> something like:

>

> -----8<-----

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 53e64a7b2ae0..bfc2b624ee98 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7489,12 +7489,8 @@ static int detach_tasks(struct lb_env *env)

>                 case migrate_misfit:

>                         load = task_h_load(p);

>

> -                       /*

> -                        * utilization of misfit task might decrease a bit

> -                        * since it has been recorded. Be conservative in the

> -                        * condition.

> -                        */

> -                       if (load < env->imbalance)

> +                       /* This is not a misfit task */

> +                       if (task_fits_capacity(p, capacity_of(env->src_cpu)))

>                                 goto next;


This could be a solution for make sure to pull only misfit task and
keep using load

>

>                         env->imbalance = 0;

> ----->8-----

>

> However what would be *even* better IMO would be:

>

> -----8<-----

> @@ -8853,6 +8853,7 @@ voluntary_active_balance(struct lb_env *env)

>                         return 1;

>         }

>

> +       /* XXX: make sure current is still a misfit task? */

>         if (env->balance_type == migrate_misfit)

>                 return 1;

>

> @@ -8966,6 +8967,20 @@ static int load_balance(int this_cpu, struct rq *this_rq,

>         env.src_rq = busiest;

>

>         ld_moved = 0;

> +

> +       /*

> +        * Misfit tasks are only misfit if they are currently running, see

> +        * update_misfit_status().

> +        *

> +        * - If they're not running, we'll get an opportunity at wakeup to

> +        *   migrate them to a bigger CPU.

> +        * - If they're running, we need to active balance them to a bigger CPU.

> +        *

> +        * Do the smart thing and go straight for active balance.

> +        */

> +       if (env->balance_type == migrate_misfit)

> +               goto active_balance;

> +


This looks ugly and add a new bypass which this patchset tries to remove
This doesn't work if your misfit task has been preempted by another
one during the load balance and waiting for the runqueue


>         if (busiest->nr_running > 1) {

>                 /*

>                  * Attempt to move tasks. If find_busiest_group has found

> @@ -9074,7 +9089,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,

>                         goto out_all_pinned;

>                 }

>         }

> -

> +active_balance:

>         if (!ld_moved) {

>                 schedstat_inc(sd->lb_failed[idle]);

>                 /*

> ----->8-----

Valentin Schneider Aug. 2, 2019, 10:49 a.m. UTC | #3

On 02/08/2019 09:29, Vincent Guittot wrote:
>> However what would be *even* better IMO would be:

>>

>> -----8<-----

>> @@ -8853,6 +8853,7 @@ voluntary_active_balance(struct lb_env *env)

>>                         return 1;

>>         }

>>

>> +       /* XXX: make sure current is still a misfit task? */

>>         if (env->balance_type == migrate_misfit)

>>                 return 1;

>>

>> @@ -8966,6 +8967,20 @@ static int load_balance(int this_cpu, struct rq *this_rq,

>>         env.src_rq = busiest;

>>

>>         ld_moved = 0;

>> +

>> +       /*

>> +        * Misfit tasks are only misfit if they are currently running, see

>> +        * update_misfit_status().

>> +        *

>> +        * - If they're not running, we'll get an opportunity at wakeup to

>> +        *   migrate them to a bigger CPU.

>> +        * - If they're running, we need to active balance them to a bigger CPU.

>> +        *

>> +        * Do the smart thing and go straight for active balance.

>> +        */

>> +       if (env->balance_type == migrate_misfit)

>> +               goto active_balance;

>> +

> 

> This looks ugly and add a new bypass which this patchset tries to remove

> This doesn't work if your misfit task has been preempted by another

> one during the load balance and waiting for the runqueue

I won't comment on aesthetics, but when it comes to preempted misfit tasks
do consider that a task *has* to have a utilization >= 80% of its CPU's
capacity to be flagged as misfit.
If it's a busy loop, it can only be preempted ~20% of the time to still be
flagged as a misfit task, so going straight for active balance will be the
right thing to do in the majority of cases.

What we gain from doing that is we save ourselves from (potentially
needlessly) iterating over the rq's tasks. That's less work and less rq
lock contention.

To put a bit of contrast on this, Qais did some profiling of usual mobile
workloads on a regular 4+4 big.LITTLE smartphone and IIRC the rq depth rose
very rarely above 5, although the tail did reach ~100 tasks. So most of the
time it would be fine to go through the detach_tasks() path.

This all deserves some actual benchmarking, I'll give it a shot.

>>         if (busiest->nr_running > 1) {

>>                 /*

>>                  * Attempt to move tasks. If find_busiest_group has found

>> @@ -9074,7 +9089,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,

>>                         goto out_all_pinned;

>>                 }

>>         }

>> -

>> +active_balance:

>>         if (!ld_moved) {

>>                 schedstat_inc(sd->lb_failed[idle]);

>>                 /*

>> ----->8-----

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53e64a7..d08cc12 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3817,16 +3817,16 @@  static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
 		return;
 
 	if (!p) {
-		rq->misfit_task_load = 0;
+		rq->misfit_task_util = 0;
 		return;
 	}
 
 	if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
-		rq->misfit_task_load = 0;
+		rq->misfit_task_util = 0;
 		return;
 	}
 
-	rq->misfit_task_load = task_h_load(p);
+	rq->misfit_task_util = task_util_est(p);
 }
 
 #else /* CONFIG_SMP */
@@ -7487,14 +7487,14 @@  static int detach_tasks(struct lb_env *env)
 			break;
 
 		case migrate_misfit:
-			load = task_h_load(p);
+			util = task_util_est(p);
 
 			/*
 			 * utilization of misfit task might decrease a bit
 			 * since it has been recorded. Be conservative in the
 			 * condition.
 			 */
-			if (load < env->imbalance)
+			if (2*util < env->imbalance)
 				goto next;
 
 			env->imbalance = 0;
@@ -7785,7 +7785,7 @@  struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	unsigned int group_asym_capacity; /* tasks should be move to preferred cpu */
-	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+	unsigned long group_misfit_task_util; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -7959,7 +7959,7 @@  check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
  */
 static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
 {
-	return rq->misfit_task_load &&
+	return rq->misfit_task_util &&
 		(rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
 		 check_cpu_capacity(rq, sd));
 }
@@ -8078,7 +8078,7 @@  group_type group_classify(struct lb_env *env,
 	if (sgs->group_asym_capacity)
 		return group_asym_capacity;
 
-	if (sgs->group_misfit_task_load)
+	if (sgs->group_misfit_task_util)
 		return group_misfit_task;
 
 	if (!group_has_capacity(env, sgs))
@@ -8164,8 +8164,8 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 
 		/* Check for a misfit task on the cpu */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-		    sgs->group_misfit_task_load < rq->misfit_task_load) {
-			sgs->group_misfit_task_load = rq->misfit_task_load;
+		    sgs->group_misfit_task_util < rq->misfit_task_util) {
+			sgs->group_misfit_task_util = rq->misfit_task_util;
 			*sg_status |= SG_OVERLOAD;
 		}
 	}
@@ -8261,7 +8261,7 @@  static bool update_sd_pick_busiest(struct lb_env *env,
 		 * If we have more than one misfit sg go with the
 		 * biggest misfit.
 		 */
-		if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+		if (sgs->group_misfit_task_util < busiest->group_misfit_task_util)
 			return false;
 		break;
 
@@ -8458,7 +8458,7 @@  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_misfit_task) {
 		/* Set imbalance to allow misfit task to be balanced. */
 		env->balance_type = migrate_misfit;
-		env->imbalance = busiest->group_misfit_task_load;
+		env->imbalance = busiest->group_misfit_task_util;
 		return;
 	}
 
@@ -8801,8 +8801,8 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 			 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
 			 * seek the "biggest" misfit task.
 			 */
-			if (rq->misfit_task_load > busiest_load) {
-				busiest_load = rq->misfit_task_load;
+			if (rq->misfit_task_util > busiest_util) {
+				busiest_util = rq->misfit_task_util;
 				busiest = rq;
 			}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7583fad..ef6e1b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -916,7 +916,7 @@  struct rq {
 
 	unsigned char		idle_balance;
 
-	unsigned long		misfit_task_load;
+	unsigned long		misfit_task_util;
 
 	/* For active balancing */
 	int			active_balance;

[v2,8/8] sched/fair: use utilization to select misfit task

Commit Message

Comments

Patch