Message ID | 1461958364-675-4-git-send-email-dietmar.eggemann@arm.com |
---|---|
State | New |
Headers | show |
On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@intel.com> wrote: > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote: >> Commit-ID: cfa10334318d8212d007da8c771187643c9cef35 >> Gitweb: http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35 >> Author: Morten Rasmussen <morten.rasmussen@arm.com> >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100 >> Committer: Ingo Molnar <mingo@kernel.org> >> CommitDate: Thu, 12 May 2016 09:55:33 +0200 >> >> sched/fair: Correct unit of load_above_capacity >> >> In calculate_imbalance() load_above_capacity currently has the unit >> [capacity] while it is used as being [load/capacity]. Not only is it >> wrong it also makes it unlikely that load_above_capacity is ever used >> as the subsequent code picks the smaller of load_above_capacity and >> the avg_load >> >> This patch ensures that load_above_capacity has the right unit >> [load/capacity]. load_above_capacity has the unit [load] and is then compared with other load >> >> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> >> [ Changed changelog to note it was in capacity unit; +rebase. ] >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> >> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> >> Cc: Linus Torvalds <torvalds@linux-foundation.org> >> Cc: Mike Galbraith <efault@gmx.de> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: linux-kernel@vger.kernel.org >> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com >> Signed-off-by: Ingo Molnar <mingo@kernel.org> >> --- >> kernel/sched/fair.c | 6 ++++-- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 2338105..218f8e8 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s >> if (busiest->group_type == group_overloaded && >> local->group_type == group_overloaded) { >> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE; >> - if (load_above_capacity > busiest->group_capacity) >> + if (load_above_capacity > busiest->group_capacity) { >> load_above_capacity -= busiest->group_capacity; >> - else >> + load_above_capacity *= NICE_0_LOAD; >> + load_above_capacity /= busiest->group_capacity; If I follow correctly the sequence, load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity so load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE / busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD so the unit is [load] Lets take the example of a group made of 2 cores with 2 threads per core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the capacity of the group is 3*SCHED_CAPACITY_SCALE. If we have 6 tasks on this group : load_above capacity = 1 *NICE_0_LOAD which means we want to remove no more than 1 tasks from the group and let 5 tasks in the group whereas we don't expect to have more than 4 tasks as we have 4 CPUs and probably less because the compute capacity of each CPU is less than the default one So i would have expected load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD - busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE load_above capacity = 3*NICE_0_LOAD which means we want to remove no more than 3 tasks and let 3 tasks in the group Regards, Vincent >> + } else >> load_above_capacity = ~0UL; >> } > > Hi Morten, > > I got the feeling this might be wrong, the NICE_0_LOAD should be scaled down. > But I hope I am wrong. > > Vincent, could you take a look?
On Fri, May 13, 2016 at 10:22:19AM +0200, Vincent Guittot wrote: > On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@intel.com> wrote: > > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote: > >> Commit-ID: cfa10334318d8212d007da8c771187643c9cef35 > >> Gitweb: http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35 > >> Author: Morten Rasmussen <morten.rasmussen@arm.com> > >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100 > >> Committer: Ingo Molnar <mingo@kernel.org> > >> CommitDate: Thu, 12 May 2016 09:55:33 +0200 > >> > >> sched/fair: Correct unit of load_above_capacity > >> > >> In calculate_imbalance() load_above_capacity currently has the unit > >> [capacity] while it is used as being [load/capacity]. Not only is it > >> wrong it also makes it unlikely that load_above_capacity is ever used > >> as the subsequent code picks the smaller of load_above_capacity and > >> the avg_load > >> > >> This patch ensures that load_above_capacity has the right unit > >> [load/capacity]. > > load_above_capacity has the unit [load] and is then compared with other load I'm not sure if talk about the same code, I'm referring to: max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity); a bit further down. Here the first option has unit [load/capacity], and the subsequent code multiplies the result with [capacity] to get back to [load]: /* How much load to actually move to equalise the imbalance */ env->imbalance = min( max_pull * busiest->group_capacity, (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; That lead me to the conclusion that load_above_capacity would have to be 'per capacity', i.e. [load/capacity], as well for the code to make sense. > > >> > >> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> > >> [ Changed changelog to note it was in capacity unit; +rebase. ] > >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > >> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > >> Cc: Linus Torvalds <torvalds@linux-foundation.org> > >> Cc: Mike Galbraith <efault@gmx.de> > >> Cc: Peter Zijlstra <peterz@infradead.org> > >> Cc: Thomas Gleixner <tglx@linutronix.de> > >> Cc: linux-kernel@vger.kernel.org > >> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com > >> Signed-off-by: Ingo Molnar <mingo@kernel.org> > >> --- > >> kernel/sched/fair.c | 6 ++++-- > >> 1 file changed, 4 insertions(+), 2 deletions(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 2338105..218f8e8 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > >> if (busiest->group_type == group_overloaded && > >> local->group_type == group_overloaded) { > >> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE; > >> - if (load_above_capacity > busiest->group_capacity) > >> + if (load_above_capacity > busiest->group_capacity) { > >> load_above_capacity -= busiest->group_capacity; > >> - else > >> + load_above_capacity *= NICE_0_LOAD; > >> + load_above_capacity /= busiest->group_capacity; > > If I follow correctly the sequence, > load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE > - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity > so > load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE / > busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD > > so the unit is [load] I'm with you for the equation, but not for the unit and I find it hard convince myself what the unit really is :-( I think it is down to the fact that we are comparing [load] directly with [capacity] here, basically saying [load] == [capacity]. Most other places we scale [load] by [capacity], we don't compare the two or otherwise imply that they are the same unit. Do we have any other examples of this? The name load_above_capacity suggests that it should have unit [load], but we actually compute it by subtracting to values, where the latter clearly has unit [capacity]. PeterZ's recent patch (1be0eb2a97 sched/fair: Clean up scale confusion) changes the former to be [capacity]. load_above_capacity is later multiplied by [capacity] to determine the imbalance which must have unit [load]. So working our way backwards, load_above_capacity must have unit [load/capacity]. However, if [load] = [capacity] then what we really have is just group-normalized [load]. As said in my other reply, the code only really makes sense for under special circumstances where [load] == [capacity]. The easy, and possibly best, way out of this mess would be to replace the code with something like PeterZ's suggestion that I tried to implement in the patch included in my other reply. > Lets take the example of a group made of 2 cores with 2 threads per > core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the > capacity of the group is 3*SCHED_CAPACITY_SCALE. > If we have 6 tasks on this group : > load_above capacity = 1 *NICE_0_LOAD which means we want to remove no > more than 1 tasks from the group and let 5 tasks in the group whereas > we don't expect to have more than 4 tasks as we have 4 CPUs and > probably less because the compute capacity of each CPU is less than > the default one > > So i would have expected > load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD - > busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE > load_above capacity = 3*NICE_0_LOAD which means we want to remove no > more than 3 tasks and let 3 tasks in the group And this is exactly you get with this patch :-) load_above_capacity (through max_pull) is multiplied by the group capacity to compute that actual amount of [load] to remove: env->imbalance = load_above_capacity * busiest->group_capacity / SCHED_CAPACITY_SCALE = 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE = 3*NICE_0_LOAD I don't think we disagree on how it should work :-) Without the capacity scaling in this patch you get: env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) * 3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE = 9*SCHED_CAPACITY_SCALE Coming back to Yuyang's question. I think it should be NICE_0_LOAD to ensure that the resulting imbalance has the proper unit [load]. I'm happy to scrap this patch and replace the whole thing with something else that makes more sense, but IMHO it is needed if the current mess should make any sense at all.
Hi Morten, On 19 May 2016 at 17:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Fri, May 13, 2016 at 10:22:19AM +0200, Vincent Guittot wrote: >> On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@intel.com> wrote: >> > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote: >> >> Commit-ID: cfa10334318d8212d007da8c771187643c9cef35 >> >> Gitweb: http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35 >> >> Author: Morten Rasmussen <morten.rasmussen@arm.com> >> >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100 >> >> Committer: Ingo Molnar <mingo@kernel.org> >> >> CommitDate: Thu, 12 May 2016 09:55:33 +0200 >> >> >> >> sched/fair: Correct unit of load_above_capacity >> >> >> >> In calculate_imbalance() load_above_capacity currently has the unit >> >> [capacity] while it is used as being [load/capacity]. Not only is it >> >> wrong it also makes it unlikely that load_above_capacity is ever used >> >> as the subsequent code picks the smaller of load_above_capacity and >> >> the avg_load >> >> >> >> This patch ensures that load_above_capacity has the right unit >> >> [load/capacity]. >> >> load_above_capacity has the unit [load] and is then compared with other load > > I'm not sure if talk about the same code, I'm referring to: > > max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity); > > a bit further down. Here the first option has unit [load/capacity], and > the subsequent code multiplies the result with [capacity] to get back to > [load]: My understand is that it multiplies and divides by capacity so we select the minimum between max_pull * busiest->group_capacity/SCHED_CAPACITY_SCALE and (sds->avg_load - local->avg_load) * local->group_capacity / SCHED_CAPACITY_SCALE so the unit of imbalance is the same as max_pull because we multiply and divide by [capacity] the imbalance's unit is [load] so max_pull also has a unit [load] and max_pull has the same unit of load_above_capacity > > /* How much load to actually move to equalise the imbalance */ > env->imbalance = min( > max_pull * busiest->group_capacity, > (sds->avg_load - local->avg_load) * local->group_capacity > ) / SCHED_CAPACITY_SCALE; > > That lead me to the conclusion that load_above_capacity would have to be > 'per capacity', i.e. [load/capacity], as well for the code to make > sense. > >> >> >> >> >> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> >> >> [ Changed changelog to note it was in capacity unit; +rebase. ] >> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> >> >> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> >> >> Cc: Linus Torvalds <torvalds@linux-foundation.org> >> >> Cc: Mike Galbraith <efault@gmx.de> >> >> Cc: Peter Zijlstra <peterz@infradead.org> >> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> >> Cc: linux-kernel@vger.kernel.org >> >> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com >> >> Signed-off-by: Ingo Molnar <mingo@kernel.org> >> >> --- >> >> kernel/sched/fair.c | 6 ++++-- >> >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> >> index 2338105..218f8e8 100644 >> >> --- a/kernel/sched/fair.c >> >> +++ b/kernel/sched/fair.c >> >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s >> >> if (busiest->group_type == group_overloaded && >> >> local->group_type == group_overloaded) { >> >> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE; >> >> - if (load_above_capacity > busiest->group_capacity) >> >> + if (load_above_capacity > busiest->group_capacity) { >> >> load_above_capacity -= busiest->group_capacity; >> >> - else >> >> + load_above_capacity *= NICE_0_LOAD; >> >> + load_above_capacity /= busiest->group_capacity; >> >> If I follow correctly the sequence, >> load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE >> - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity >> so >> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE / >> busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD >> >> so the unit is [load] > > I'm with you for the equation, but not for the unit and I find it hard > convince myself what the unit really is :-( the sole NICE_0_LOAD in the equation above tends to confirms that it's only [load] > > I think it is down to the fact that we are comparing [load] directly > with [capacity] here, basically saying [load] == [capacity]. Most other > places we scale [load] by [capacity], we don't compare the two or > otherwise imply that they are the same unit. Do we have any other > examples of this? IIUC the goal of this part of calculate_imbalance, we make the assumption that one task with NICE_0_LOAD should fit in one core with SCHED_CAPACITY_SCALE capacity and we want to ensure that we will not make a CPU idle > > The name load_above_capacity suggests that it should have unit [load], > but we actually compute it by subtracting to values, where the latter > clearly has unit [capacity]. PeterZ's recent patch (1be0eb2a97 > sched/fair: Clean up scale confusion) changes the former to be > [capacity]. I have made a comment on this. The original equation was load_above_capacity -= busiest->group_capacity which come from the optimization of load_above_capacity -= busiest->group_capacity * SCHED_LOAD_SCALE / SCHED_CAPACITY_SCALE The assumption of the optimization was the SCHED_LOAD_SCALE == SCHED_CAPACITY_SCALE and SCHED_LOAD_SCALE == NICE_0_LOAD but it's not always true if we enable the extra precision for 64bits system > > load_above_capacity is later multiplied by [capacity] to determine the > imbalance which must have unit [load]. So working our way backwards, it's multiplied by xxx->group_capacity and divided by / SCHED_CAPACITY_SCALE; > load_above_capacity must have unit [load/capacity]. However, if [load] = > [capacity] then what we really have is just group-normalized [load]. > > As said in my other reply, the code only really makes sense for under > special circumstances where [load] == [capacity]. The easy, and possibly > best, way out of this mess would be to replace the code with something > like PeterZ's suggestion that I tried to implement in the patch included > in my other reply. I'm fine with replacing this part by something more simple > >> Lets take the example of a group made of 2 cores with 2 threads per >> core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the >> capacity of the group is 3*SCHED_CAPACITY_SCALE. >> If we have 6 tasks on this group : >> load_above capacity = 1 *NICE_0_LOAD which means we want to remove no >> more than 1 tasks from the group and let 5 tasks in the group whereas >> we don't expect to have more than 4 tasks as we have 4 CPUs and >> probably less because the compute capacity of each CPU is less than >> the default one >> >> So i would have expected >> load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD - >> busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE >> load_above capacity = 3*NICE_0_LOAD which means we want to remove no >> more than 3 tasks and let 3 tasks in the group > > And this is exactly you get with this patch :-) load_above_capacity > (through max_pull) is multiplied by the group capacity to compute that > actual amount of [load] to remove: I forgot that additional weight of the load by group's capacity/SCHED_CAPACITY_SCALE > > env->imbalance = load_above_capacity * busiest->group_capacity / > SCHED_CAPACITY_SCALE > > = 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE / > SCHED_CAPACITY_SCALE > > = 3*NICE_0_LOAD > > I don't think we disagree on how it should work :-) Without the capacity > scaling in this patch you get: > > env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) * > 3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE > > = 9*SCHED_CAPACITY_SCALE > > Coming back to Yuyang's question. I think it should be NICE_0_LOAD to > ensure that the resulting imbalance has the proper unit [load]. yes, i agree it's should NICE_0_LOAD > > I'm happy to scrap this patch and replace the whole thing with something > else that makes more sense, but IMHO it is needed if the current mess > should make any sense at all.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bc19fee94f39..c066574cff04 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7016,9 +7016,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s local->group_type == group_overloaded) { load_above_capacity = busiest->sum_nr_running * SCHED_LOAD_SCALE; - if (load_above_capacity > busiest->group_capacity) + if (load_above_capacity > busiest->group_capacity) { load_above_capacity -= busiest->group_capacity; - else + load_above_capacity *= SCHED_LOAD_SCALE; + load_above_capacity /= busiest->group_capacity; + } else load_above_capacity = ~0UL; }