From patchwork Fri Nov 25 15:34:33 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vincent Guittot X-Patchwork-Id: 84162 Delivered-To: patch@linaro.org Received: by 10.140.20.101 with SMTP id 92csp165754qgi; Fri, 25 Nov 2016 07:36:44 -0800 (PST) X-Received: by 10.98.152.3 with SMTP id q3mr8352485pfd.144.1480088204594; Fri, 25 Nov 2016 07:36:44 -0800 (PST) Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x18si45130927pge.5.2016.11.25.07.36.42; Fri, 25 Nov 2016 07:36:44 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755406AbcKYPge (ORCPT + 25 others); Fri, 25 Nov 2016 10:36:34 -0500 Received: from mail-wj0-f180.google.com ([209.85.210.180]:33003 "EHLO mail-wj0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755213AbcKYPgE (ORCPT ); Fri, 25 Nov 2016 10:36:04 -0500 Received: by mail-wj0-f180.google.com with SMTP id xy5so61897302wjc.0 for ; Fri, 25 Nov 2016 07:34:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=5qw8CehALxlWclrBH7Smqv/pl4AcAaeyLUyA9fCEdEk=; b=KRL7UnpT4oVM1MIRElOMJXwGQ1W92QF59Fip1lzEcQtNKcGD4oDpapzi8o2FVRyuJI lp/UTa+o5g9UrYvIMkBgqFsyMMJ/g6jncBo8enUNqaNypLkWQ7PFGPTpJKTjokJ43LoY O+PDtOOKG5Bkuw1e0wV3q7PgYz0qyRye5Ykr4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=5qw8CehALxlWclrBH7Smqv/pl4AcAaeyLUyA9fCEdEk=; b=JdYocKxo6GzStIWacmxYckZRANR2l8C0SS5BMQXuj29yrkkFJM6D5polM7of+MjTas JoPemMtWD1di56vTEOYZf9YaqQRaEpEsRZng1V4pUzaARrtWwaI2Ws6Ej/RqBIO/G2fo DQ1x6qiTKzKxzK/gTx3Pyl6KQNUHoJUdJy7dIANa0Nv6bdlaSk/tnaEW9OWiQFxvB09M NCBfkx94XFY5FRwsLyEijqvpmjRCHjNqYaaVrgWT9OMmkDSlg/ZJnKmwxluXlwXBNaJ0 0VIZDcZXoUvqodPoB/nFr7NVpOT8wx8968iPDQw2q22OsxsIV3DkshPdxAB7AZUAndEu m02w== X-Gm-Message-State: AKaTC03/06fGmFUH+hGtmdpNYKEoZ/Ox+BPn4F0dSTfEkExoYMx5DRtHUJqJ3yjInEaLFJK/ X-Received: by 10.194.235.198 with SMTP id uo6mr9258638wjc.40.1480088094966; Fri, 25 Nov 2016 07:34:54 -0800 (PST) Received: from localhost.localdomain ([2a01:e0a:f:6020:4812:2d30:6a06:ea6d]) by smtp.gmail.com with ESMTPSA id vr9sm47572039wjc.35.2016.11.25.07.34.53 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 25 Nov 2016 07:34:54 -0800 (PST) From: Vincent Guittot To: peterz@infradead.org, mingo@kernel.org, linux-kernel@vger.kernel.org, matt@codeblueprint.co.uk, Morten.Rasmussen@arm.com, dietmar.eggemann@arm.com Cc: kernellwp@gmail.com, yuyang.du@intel.com, umgwanakikbuti@gmail.com, Vincent Guittot Subject: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group Date: Fri, 25 Nov 2016 16:34:33 +0100 Message-Id: <1480088073-11642-3-git-send-email-vincent.guittot@linaro.org> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1480088073-11642-1-git-send-email-vincent.guittot@linaro.org> References: <1480088073-11642-1-git-send-email-vincent.guittot@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org find_idlest_group() only compares the runnable_load_avg when looking for the least loaded group. But on fork intensive use case like hackbench where tasks blocked quickly after the fork, this can lead to selecting the same CPU instead of other CPUs, which have similar runnable load but a lower load_avg. When the runnable_load_avg of 2 CPUs are close, we now take into account the amount of blocked load as a 2nd selection factor. There is now 3 zones for the runnable_load of the rq: -[0 .. (runnable_load - imbalance)] : Select the new rq which has significantly less runnable_load -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The runnable load are close so we use load_avg to chose between the 2 rq -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which has significantly less runnable_load For use case like hackbench, this enable the scheduler to select different CPUs during the fork sequence and to spread tasks across the system. Tests have been done on a Hikey board (ARM based octo cores) for several kernel. The result below gives min, max, avg and stdev values of 18 runs with each configuration. The v4.8+patches configuration also includes the changes below which is part of the proposal made by Peter to ensure that the clock will be up to date when the fork task will be attached to the rq. @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p) __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif rq = __task_rq_lock(p, &rf); + update_rq_clock(rq); post_init_entity_util_avg(&p->se); activate_task(rq, p, 0); hackbench -P -g 1 ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches min 0.049 0.050 0.051 0,048 avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%) max 0.066 0.068 0.070 0,063 stdev +/-9% +/-9% +/-8% +/-9% Signed-off-by: Vincent Guittot --- Changes since v2: - Rebase on latest sched/core - Get same results with the rebase and the fix mentioned in patch 01 kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 38 insertions(+), 10 deletions(-) -- 2.7.4 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 820a787..ecb5ee8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5395,16 +5395,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, { struct sched_group *idlest = NULL, *group = sd->groups; struct sched_group *most_spare_sg = NULL; - unsigned long min_load = ULONG_MAX, this_load = 0; + unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0; + unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0; unsigned long most_spare = 0, this_spare = 0; int load_idx = sd->forkexec_idx; - int imbalance = 100 + (sd->imbalance_pct-100)/2; + int imbalance_scale = 100 + (sd->imbalance_pct-100)/2; + unsigned long imbalance = scale_load_down(NICE_0_LOAD) * + (sd->imbalance_pct-100) / 100; if (sd_flag & SD_BALANCE_WAKE) load_idx = sd->wake_idx; do { - unsigned long load, avg_load, spare_cap, max_spare_cap; + unsigned long load, avg_load, runnable_load; + unsigned long spare_cap, max_spare_cap; int local_group; int i; @@ -5421,6 +5425,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, * the group containing the CPU with most spare capacity. */ avg_load = 0; + runnable_load = 0; max_spare_cap = 0; for_each_cpu(i, sched_group_cpus(group)) { @@ -5430,7 +5435,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, else load = target_load(i, load_idx); - avg_load += load; + runnable_load += load; + + avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs); spare_cap = capacity_spare_wake(i, p); @@ -5439,14 +5446,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, } /* Adjust by relative CPU capacity of the group */ - avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity; + avg_load = (avg_load * SCHED_CAPACITY_SCALE) / + group->sgc->capacity; + runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / + group->sgc->capacity; if (local_group) { - this_load = avg_load; + this_runnable_load = runnable_load; + this_avg_load = avg_load; this_spare = max_spare_cap; } else { - if (avg_load < min_load) { - min_load = avg_load; + if (min_runnable_load > (runnable_load + imbalance)) { + /* + * The runnable load is significantly smaller + * so we can pick this new cpu + */ + min_runnable_load = runnable_load; + min_avg_load = avg_load; + idlest = group; + } else if ((runnable_load < (min_runnable_load + imbalance)) && + (100*min_avg_load > imbalance_scale*avg_load)) { + /* + * The runnable loads are close so we take + * into account blocked load through avg_load + * which is blocked + runnable load + */ + min_avg_load = avg_load; idlest = group; } @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, goto no_spare; if (this_spare > task_util(p) / 2 && - imbalance*this_spare > 100*most_spare) + imbalance_scale*this_spare > 100*most_spare) return NULL; else if (most_spare > task_util(p) / 2) return most_spare_sg; no_spare: - if (!idlest || 100*this_load < imbalance*min_load) + if (!idlest || + (min_runnable_load > (this_runnable_load + imbalance)) || + ((this_runnable_load < (min_runnable_load + imbalance)) && + (100*min_avg_load > imbalance_scale*this_avg_load))) return NULL; return idlest; }