Message ID | 1480088073-11642-3-git-send-email-vincent.guittot@linaro.org |
---|---|
State | Superseded |
Headers | show |
On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote: > find_idlest_group() only compares the runnable_load_avg when looking for > the least loaded group. But on fork intensive use case like hackbench > where tasks blocked quickly after the fork, this can lead to selecting the > same CPU instead of other CPUs, which have similar runnable load but a > lower load_avg. > > When the runnable_load_avg of 2 CPUs are close, we now take into account > the amount of blocked load as a 2nd selection factor. There is now 3 zones > for the runnable_load of the rq: > -[0 .. (runnable_load - imbalance)] : Select the new rq which has > significantly less runnable_load > -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The > runnable load are close so we use load_avg to chose between the 2 rq > -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which > has significantly less runnable_load > > For use case like hackbench, this enable the scheduler to select different > CPUs during the fork sequence and to spread tasks across the system. > > Tests have been done on a Hikey board (ARM based octo cores) for several > kernel. The result below gives min, max, avg and stdev values of 18 runs > with each configuration. > > The v4.8+patches configuration also includes the changes below which is > part of the proposal made by Peter to ensure that the clock will be up to > date when the fork task will be attached to the rq. > > @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p) > __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); > #endif > rq = __task_rq_lock(p, &rf); > + update_rq_clock(rq); > post_init_entity_util_avg(&p->se); > > activate_task(rq, p, 0); > > hackbench -P -g 1 > > ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches > min 0.049 0.050 0.051 0,048 > avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%) > max 0.066 0.068 0.070 0,063 > stdev +/-9% +/-9% +/-8% +/-9% > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > > Changes since v2: > - Rebase on latest sched/core > - Get same results with the rebase and the fix mentioned in patch 01 > > kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 38 insertions(+), 10 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 820a787..ecb5ee8 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5395,16 +5395,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, > { > struct sched_group *idlest = NULL, *group = sd->groups; > struct sched_group *most_spare_sg = NULL; > - unsigned long min_load = ULONG_MAX, this_load = 0; > + unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0; > + unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0; > unsigned long most_spare = 0, this_spare = 0; > int load_idx = sd->forkexec_idx; > - int imbalance = 100 + (sd->imbalance_pct-100)/2; > + int imbalance_scale = 100 + (sd->imbalance_pct-100)/2; > + unsigned long imbalance = scale_load_down(NICE_0_LOAD) * > + (sd->imbalance_pct-100) / 100; > > if (sd_flag & SD_BALANCE_WAKE) > load_idx = sd->wake_idx; > > do { > - unsigned long load, avg_load, spare_cap, max_spare_cap; > + unsigned long load, avg_load, runnable_load; > + unsigned long spare_cap, max_spare_cap; > int local_group; > int i; > > @@ -5421,6 +5425,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, > * the group containing the CPU with most spare capacity. > */ > avg_load = 0; > + runnable_load = 0; > max_spare_cap = 0; > > for_each_cpu(i, sched_group_cpus(group)) { > @@ -5430,7 +5435,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, > else > load = target_load(i, load_idx); > > - avg_load += load; > + runnable_load += load; > + > + avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs); > > spare_cap = capacity_spare_wake(i, p); > > @@ -5439,14 +5446,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, > } > > /* Adjust by relative CPU capacity of the group */ > - avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity; > + avg_load = (avg_load * SCHED_CAPACITY_SCALE) / > + group->sgc->capacity; > + runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / > + group->sgc->capacity; > > if (local_group) { > - this_load = avg_load; > + this_runnable_load = runnable_load; > + this_avg_load = avg_load; > this_spare = max_spare_cap; > } else { > - if (avg_load < min_load) { > - min_load = avg_load; > + if (min_runnable_load > (runnable_load + imbalance)) { > + /* > + * The runnable load is significantly smaller > + * so we can pick this new cpu > + */ > + min_runnable_load = runnable_load; > + min_avg_load = avg_load; > + idlest = group; > + } else if ((runnable_load < (min_runnable_load + imbalance)) && > + (100*min_avg_load > imbalance_scale*avg_load)) { > + /* > + * The runnable loads are close so we take > + * into account blocked load through avg_load > + * which is blocked + runnable load > + */ > + min_avg_load = avg_load; > idlest = group; > } > > @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, > goto no_spare; > > if (this_spare > task_util(p) / 2 && > - imbalance*this_spare > 100*most_spare) > + imbalance_scale*this_spare > 100*most_spare) > return NULL; > else if (most_spare > task_util(p) / 2) > return most_spare_sg; > > no_spare: > - if (!idlest || 100*this_load < imbalance*min_load) > + if (!idlest || > + (min_runnable_load > (this_runnable_load + imbalance)) || > + ((this_runnable_load < (min_runnable_load + imbalance)) && > + (100*min_avg_load > imbalance_scale*this_avg_load))) I don't get why you have imbalance_scale applied to this_avg_load and not min_avg_load. IIUC, you end up preferring non-local groups? If we take the example where this_runnable_load == min_runnable_load and this_avg_load == min_avg_load. In this case, and in cases where min_avg_load is slightly bigger than this_avg_load, we end up picking the 'idlest' group even if the local group is equally good or even slightly better? > return NULL; > return idlest; > } Overall, I like that load_avg is being brought in to make better decisions. The variable naming is a bit confusing. For example, runnable_load is a capacity-average just like avg_load. 'imbalance' is now an absolute capacity-average margin, but it is hard to come up with better short alternatives. Although 'imbalance' is based on the existing imbalance_pct, I find somewhat arbitrary. Why is (imbalance_pct-100)*1024/100 a good absolute margin to define the interval where we want to consider load_avg? I guess it is case of 'we had to pick some value', which we have done in many other places. Though, IMHO, it is a bit strange that imbalance_pct is used in two different ways to bias comparison in the same function. It used to be only used as a scaling factor (now imbalance_scale), while this patch proposes to use it for computing an absolute margin (imbalance) as well. It is not major issue, but it is not clear why it is used differently to compare two metrics that are relatively closely related. Morten
On 30 November 2016 at 13:49, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote: >> find_idlest_group() only compares the runnable_load_avg when looking for >> the least loaded group. But on fork intensive use case like hackbench [snip] >> + min_avg_load = avg_load; >> + idlest = group; >> + } else if ((runnable_load < (min_runnable_load + imbalance)) && >> + (100*min_avg_load > imbalance_scale*avg_load)) { >> + /* >> + * The runnable loads are close so we take >> + * into account blocked load through avg_load >> + * which is blocked + runnable load >> + */ >> + min_avg_load = avg_load; >> idlest = group; >> } >> >> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, >> goto no_spare; >> >> if (this_spare > task_util(p) / 2 && >> - imbalance*this_spare > 100*most_spare) >> + imbalance_scale*this_spare > 100*most_spare) >> return NULL; >> else if (most_spare > task_util(p) / 2) >> return most_spare_sg; >> >> no_spare: >> - if (!idlest || 100*this_load < imbalance*min_load) >> + if (!idlest || >> + (min_runnable_load > (this_runnable_load + imbalance)) || >> + ((this_runnable_load < (min_runnable_load + imbalance)) && >> + (100*min_avg_load > imbalance_scale*this_avg_load))) > > I don't get why you have imbalance_scale applied to this_avg_load and > not min_avg_load. IIUC, you end up preferring non-local groups? In fact, I have keep the same condition that is used when looping the group. You're right that we should prefer local rq if avg_load are close and test the condition (100*this_avg_load > imbalance_scale*min_avg_load) instead > > If we take the example where this_runnable_load == min_runnable_load and > this_avg_load == min_avg_load. In this case, and in cases where > min_avg_load is slightly bigger than this_avg_load, we end up picking > the 'idlest' group even if the local group is equally good or even > slightly better? > >> return NULL; >> return idlest; >> } > > Overall, I like that load_avg is being brought in to make better > decisions. The variable naming is a bit confusing. For example, > runnable_load is a capacity-average just like avg_load. 'imbalance' is > now an absolute capacity-average margin, but it is hard to come up with > better short alternatives. > > Although 'imbalance' is based on the existing imbalance_pct, I find > somewhat arbitrary. Why is (imbalance_pct-100)*1024/100 a good absolute > margin to define the interval where we want to consider load_avg? I > guess it is case of 'we had to pick some value', which we have done in > many other places. Though, IMHO, it is a bit strange that imbalance_pct > is used in two different ways to bias comparison in the same function. I see imbalance_pct like the definition of the acceptable imbalance % for a sched_domain. This % is then used against the current load or to define an absolute value. > It used to be only used as a scaling factor (now imbalance_scale), while > this patch proposes to use it for computing an absolute margin > (imbalance) as well. It is not major issue, but it is not clear why it > is used differently to compare two metrics that are relatively closely > related. In fact, scaling factor (imbalance) doesn't work well with small value. As an example, the use of a scaling factor fails as soon as this_runnable_load = 0 because we always selected local rq even if min_runnable_load is only 1 which doesn't really make sense because they are just the same. > > Morten
On 30 November 2016 at 14:49, Vincent Guittot <vincent.guittot@linaro.org> wrote: > On 30 November 2016 at 13:49, Morten Rasmussen <morten.rasmussen@arm.com> wrote: >> On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote: >>> find_idlest_group() only compares the runnable_load_avg when looking for >>> the least loaded group. But on fork intensive use case like hackbench > > [snip] > >>> + min_avg_load = avg_load; >>> + idlest = group; >>> + } else if ((runnable_load < (min_runnable_load + imbalance)) && >>> + (100*min_avg_load > imbalance_scale*avg_load)) { >>> + /* >>> + * The runnable loads are close so we take >>> + * into account blocked load through avg_load >>> + * which is blocked + runnable load >>> + */ >>> + min_avg_load = avg_load; >>> idlest = group; >>> } >>> >>> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, >>> goto no_spare; >>> >>> if (this_spare > task_util(p) / 2 && >>> - imbalance*this_spare > 100*most_spare) >>> + imbalance_scale*this_spare > 100*most_spare) >>> return NULL; >>> else if (most_spare > task_util(p) / 2) >>> return most_spare_sg; >>> >>> no_spare: >>> - if (!idlest || 100*this_load < imbalance*min_load) >>> + if (!idlest || >>> + (min_runnable_load > (this_runnable_load + imbalance)) || >>> + ((this_runnable_load < (min_runnable_load + imbalance)) && >>> + (100*min_avg_load > imbalance_scale*this_avg_load))) >> >> I don't get why you have imbalance_scale applied to this_avg_load and >> not min_avg_load. IIUC, you end up preferring non-local groups? > > In fact, I have keep the same condition that is used when looping the group. > You're right that we should prefer local rq if avg_load are close and > test the condition > (100*this_avg_load > imbalance_scale*min_avg_load) instead Of course the correct condition is (100*this_avg_load < imbalance_scale*min_avg_load) > >> >> If we take the example where this_runnable_load == min_runnable_load and >> this_avg_load == min_avg_load. In this case, and in cases where >> min_avg_load is slightly bigger than this_avg_load, we end up picking >> the 'idlest' group even if the local group is equally good or even >> slightly better? >> >>> return NULL; >>> return idlest; >>> } >> >> Overall, I like that load_avg is being brought in to make better >> decisions. The variable naming is a bit confusing. For example, >> runnable_load is a capacity-average just like avg_load. 'imbalance' is >> now an absolute capacity-average margin, but it is hard to come up with >> better short alternatives. >> >> Although 'imbalance' is based on the existing imbalance_pct, I find >> somewhat arbitrary. Why is (imbalance_pct-100)*1024/100 a good absolute >> margin to define the interval where we want to consider load_avg? I >> guess it is case of 'we had to pick some value', which we have done in >> many other places. Though, IMHO, it is a bit strange that imbalance_pct >> is used in two different ways to bias comparison in the same function. > > I see imbalance_pct like the definition of the acceptable imbalance % > for a sched_domain. This % is then used against the current load or to > define an absolute value. > >> It used to be only used as a scaling factor (now imbalance_scale), while >> this patch proposes to use it for computing an absolute margin >> (imbalance) as well. It is not major issue, but it is not clear why it >> is used differently to compare two metrics that are relatively closely >> related. > > In fact, scaling factor (imbalance) doesn't work well with small > value. As an example, the use of a scaling factor fails as soon as > this_runnable_load = 0 because we always selected local rq even if > min_runnable_load is only 1 which doesn't really make sense because > they are just the same. > >> >> Morten
On Wed, Nov 30, 2016 at 02:49:11PM +0100, Vincent Guittot wrote: > On 30 November 2016 at 13:49, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote: > >> find_idlest_group() only compares the runnable_load_avg when looking for > >> the least loaded group. But on fork intensive use case like hackbench > > [snip] > > >> + min_avg_load = avg_load; > >> + idlest = group; > >> + } else if ((runnable_load < (min_runnable_load + imbalance)) && > >> + (100*min_avg_load > imbalance_scale*avg_load)) { > >> + /* > >> + * The runnable loads are close so we take > >> + * into account blocked load through avg_load > >> + * which is blocked + runnable load > >> + */ > >> + min_avg_load = avg_load; > >> idlest = group; > >> } > >> > >> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, > >> goto no_spare; > >> > >> if (this_spare > task_util(p) / 2 && > >> - imbalance*this_spare > 100*most_spare) > >> + imbalance_scale*this_spare > 100*most_spare) > >> return NULL; > >> else if (most_spare > task_util(p) / 2) > >> return most_spare_sg; > >> > >> no_spare: > >> - if (!idlest || 100*this_load < imbalance*min_load) > >> + if (!idlest || > >> + (min_runnable_load > (this_runnable_load + imbalance)) || > >> + ((this_runnable_load < (min_runnable_load + imbalance)) && > >> + (100*min_avg_load > imbalance_scale*this_avg_load))) > > > > I don't get why you have imbalance_scale applied to this_avg_load and > > not min_avg_load. IIUC, you end up preferring non-local groups? > > In fact, I have keep the same condition that is used when looping the group. The logic is inverted compared to the group loop as there you after picking a better group. There it makes sense that you accept groups with a slightly higher runnable_load if they have a much better avg_load. Here you are after rejecting 'idlest' group if it isn't significantly better than the local group, so the conditions have to be inverted. > You're right that we should prefer local rq if avg_load are close and > test the condition > (100*this_avg_load > imbalance_scale*min_avg_load) instead I would argue you should switch the inequality operator around: (100*this_avg_load < imbalance_scale*min_avg_load) So it becomes true unless min_avg_load is significantly less than this_avg_load meaning that we will ignore 'idlest' and return NULL. This is also in line with old condition (100*this_load < imbalance*min_load). > > > > If we take the example where this_runnable_load == min_runnable_load and > > this_avg_load == min_avg_load. In this case, and in cases where > > min_avg_load is slightly bigger than this_avg_load, we end up picking > > the 'idlest' group even if the local group is equally good or even > > slightly better? > > > >> return NULL; > >> return idlest; > >> } > > > > Overall, I like that load_avg is being brought in to make better > > decisions. The variable naming is a bit confusing. For example, > > runnable_load is a capacity-average just like avg_load. 'imbalance' is > > now an absolute capacity-average margin, but it is hard to come up with > > better short alternatives. > > > > Although 'imbalance' is based on the existing imbalance_pct, I find > > somewhat arbitrary. Why is (imbalance_pct-100)*1024/100 a good absolute > > margin to define the interval where we want to consider load_avg? I > > guess it is case of 'we had to pick some value', which we have done in > > many other places. Though, IMHO, it is a bit strange that imbalance_pct > > is used in two different ways to bias comparison in the same function. > > I see imbalance_pct like the definition of the acceptable imbalance % > for a sched_domain. This % is then used against the current load or to > define an absolute value. > > > It used to be only used as a scaling factor (now imbalance_scale), while > > this patch proposes to use it for computing an absolute margin > > (imbalance) as well. It is not major issue, but it is not clear why it > > is used differently to compare two metrics that are relatively closely > > related. > > In fact, scaling factor (imbalance) doesn't work well with small > value. As an example, the use of a scaling factor fails as soon as > this_runnable_load = 0 because we always selected local rq even if > min_runnable_load is only 1 which doesn't really make sense because > they are just the same. Agreed, using the operator for scaling is not ideal for low load scenarios. The spare-capacity checking is fixing that issue for some scenarios but not all. Could we mention these points somewhere so people can be reminded later? In the commit message, if not as a comment in the code?
On Wed, Nov 30, 2016 at 02:54:00PM +0100, Vincent Guittot wrote: > On 30 November 2016 at 14:49, Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > On 30 November 2016 at 13:49, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > >> On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote: > >>> find_idlest_group() only compares the runnable_load_avg when looking for > >>> the least loaded group. But on fork intensive use case like hackbench > > > > [snip] > > > >>> + min_avg_load = avg_load; > >>> + idlest = group; > >>> + } else if ((runnable_load < (min_runnable_load + imbalance)) && > >>> + (100*min_avg_load > imbalance_scale*avg_load)) { > >>> + /* > >>> + * The runnable loads are close so we take > >>> + * into account blocked load through avg_load > >>> + * which is blocked + runnable load > >>> + */ > >>> + min_avg_load = avg_load; > >>> idlest = group; > >>> } > >>> > >>> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, > >>> goto no_spare; > >>> > >>> if (this_spare > task_util(p) / 2 && > >>> - imbalance*this_spare > 100*most_spare) > >>> + imbalance_scale*this_spare > 100*most_spare) > >>> return NULL; > >>> else if (most_spare > task_util(p) / 2) > >>> return most_spare_sg; > >>> > >>> no_spare: > >>> - if (!idlest || 100*this_load < imbalance*min_load) > >>> + if (!idlest || > >>> + (min_runnable_load > (this_runnable_load + imbalance)) || > >>> + ((this_runnable_load < (min_runnable_load + imbalance)) && > >>> + (100*min_avg_load > imbalance_scale*this_avg_load))) > >> > >> I don't get why you have imbalance_scale applied to this_avg_load and > >> not min_avg_load. IIUC, you end up preferring non-local groups? > > > > In fact, I have keep the same condition that is used when looping the group. > > You're right that we should prefer local rq if avg_load are close and > > test the condition > > (100*this_avg_load > imbalance_scale*min_avg_load) instead > > Of course the correct condition is > (100*this_avg_load < imbalance_scale*min_avg_load) Agreed, I should have read the entire thread before replying :-)
On 30 November 2016 at 15:24, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Wed, Nov 30, 2016 at 02:54:00PM +0100, Vincent Guittot wrote: >> On 30 November 2016 at 14:49, Vincent Guittot >> <vincent.guittot@linaro.org> wrote: >> > On 30 November 2016 at 13:49, Morten Rasmussen <morten.rasmussen@arm.com> wrote: >> >> On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote: >> >>> find_idlest_group() only compares the runnable_load_avg when looking for >> >>> the least loaded group. But on fork intensive use case like hackbench >> > >> > [snip] >> > >> >>> + min_avg_load = avg_load; >> >>> + idlest = group; >> >>> + } else if ((runnable_load < (min_runnable_load + imbalance)) && >> >>> + (100*min_avg_load > imbalance_scale*avg_load)) { >> >>> + /* >> >>> + * The runnable loads are close so we take >> >>> + * into account blocked load through avg_load >> >>> + * which is blocked + runnable load >> >>> + */ >> >>> + min_avg_load = avg_load; >> >>> idlest = group; >> >>> } >> >>> >> >>> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, >> >>> goto no_spare; >> >>> >> >>> if (this_spare > task_util(p) / 2 && >> >>> - imbalance*this_spare > 100*most_spare) >> >>> + imbalance_scale*this_spare > 100*most_spare) >> >>> return NULL; >> >>> else if (most_spare > task_util(p) / 2) >> >>> return most_spare_sg; >> >>> >> >>> no_spare: >> >>> - if (!idlest || 100*this_load < imbalance*min_load) >> >>> + if (!idlest || >> >>> + (min_runnable_load > (this_runnable_load + imbalance)) || >> >>> + ((this_runnable_load < (min_runnable_load + imbalance)) && >> >>> + (100*min_avg_load > imbalance_scale*this_avg_load))) >> >> >> >> I don't get why you have imbalance_scale applied to this_avg_load and >> >> not min_avg_load. IIUC, you end up preferring non-local groups? >> > >> > In fact, I have keep the same condition that is used when looping the group. >> > You're right that we should prefer local rq if avg_load are close and >> > test the condition >> > (100*this_avg_load > imbalance_scale*min_avg_load) instead >> >> Of course the correct condition is >> (100*this_avg_load < imbalance_scale*min_avg_load) > > Agreed, I should have read the entire thread before replying :-) Interestingly, the original condition (100*min_avg_load > imbalance_scale*this_avg_load) gives better performance result for the hackbench test than the new one : (100*this_avg_load < imbalance_scale*min_avg_load) Matt, Have you been able to get some results for the patchset ? Vincent
On Fri, 02 Dec, at 04:20:54PM, Vincent Guittot wrote: > > Matt, > > Have you been able to get some results for the patchset ? Sorry, I messed up the test config and the tests never ran :( I've redone everything and fast-tracked the results on the SUSE grid now. Should have results by morning.
On Fri, Nov 25, 2016 at 7:34 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > find_idlest_group() only compares the runnable_load_avg when looking for > the least loaded group. But on fork intensive use case like hackbench > where tasks blocked quickly after the fork, this can lead to selecting the > same CPU instead of other CPUs, which have similar runnable load but a > lower load_avg. > > When the runnable_load_avg of 2 CPUs are close, we now take into account > the amount of blocked load as a 2nd selection factor. There is now 3 zones > for the runnable_load of the rq: > -[0 .. (runnable_load - imbalance)] : Select the new rq which has > significantly less runnable_load > -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The > runnable load are close so we use load_avg to chose between the 2 rq > -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which > has significantly less runnable_load > For background, is this from the "A decade of wasted cores" paper's patches? What's the expected typical gain? Thanks, Brendan
On Fri, 02 Dec, at 07:31:04PM, Brendan Gregg wrote: > > For background, is this from the "A decade of wasted cores" paper's > patches? No, this patch fixes an issue I originally reported here, https://lkml.kernel.org/r/20160923115808.2330-1-matt@codeblueprint.co.uk Essentially, if you have an idle or partially-idle system and a workload that consists of fork()'ing a bunch of tasks, where each of those tasks immediately sleeps waiting for some wakeup, then those tasks aren't spread across all idle CPUs very well. We saw this issue when running hackbench with a small loop count, such that the actual benchmark setup (fork()'ing) is where the majority of the runtime is spent. In that scenario, there's a large potential/blocked load, but essentially no runnable load, and the balance on fork scheduler code only cares about runnable load without Vincent's patch applied. The closest thing I can find in the "A decade of wasted cores" paper is "The Overload-on-Wakeup bug", but I don't think that's the issue here since, a) We're balancing on fork, not wakeup b) The fork on balance code balances across nodes OK > What's the expected typical gain? Thanks, The results are still coming back from the SUSE performance test grid but they do show that this patch is mainly a win for multi-socket machines with more than 8 cores or thereabouts. [ Vincent, I'll follow up to your PATCH 1/2 with the results that are specifically for that patch ] Assuming a fork-intensive or fork-dominated workload, and a multi-socket machine, such as this 2 socket, NUMA, with 12 cores and HT enabled (48 cpus), we saw a very clear win between +10% and +15% for processes communicating via pipes, (1) tip-sched = tip/sched/core branch (2) fix-fig-for-fork = (1) + PATCH 1/2 (3) fix-sig = (1) + (2) + PATCH 2/2 hackbench-process-pipes 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6 tip-sched fix-fig-for-fork fix-sig Amean 1 0.0717 ( 0.00%) 0.0696 ( 2.99%) 0.0730 ( -1.79%) Amean 4 0.1244 ( 0.00%) 0.1200 ( 3.56%) 0.1190 ( 4.36%) Amean 7 0.1891 ( 0.00%) 0.1937 ( -2.42%) 0.1831 ( 3.17%) Amean 12 0.2964 ( 0.00%) 0.3116 ( -5.11%) 0.2784 ( 6.07%) Amean 21 0.4011 ( 0.00%) 0.4090 ( -1.96%) 0.3574 ( 10.90%) Amean 30 0.4944 ( 0.00%) 0.4654 ( 5.87%) 0.4171 ( 15.63%) Amean 48 0.6113 ( 0.00%) 0.6309 ( -3.20%) 0.5331 ( 12.78%) Amean 79 0.8616 ( 0.00%) 0.8706 ( -1.04%) 0.7710 ( 10.51%) Amean 110 1.1304 ( 0.00%) 1.2211 ( -8.02%) 1.0163 ( 10.10%) Amean 141 1.3754 ( 0.00%) 1.4279 ( -3.81%) 1.2803 ( 6.92%) Amean 172 1.6217 ( 0.00%) 1.7367 ( -7.09%) 1.5363 ( 5.27%) Amean 192 1.7809 ( 0.00%) 2.0199 (-13.42%) 1.7129 ( 3.82%) Things look even better when using threads and pipes, with wins between 11% and 29% when looking at results outside of the noise, hackbench-thread-pipes 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6 tip-sched fix-fig-for-fork fix-sig Amean 1 0.0736 ( 0.00%) 0.0794 ( -7.96%) 0.0779 ( -5.83%) Amean 4 0.1709 ( 0.00%) 0.1690 ( 1.09%) 0.1663 ( 2.68%) Amean 7 0.2836 ( 0.00%) 0.3080 ( -8.61%) 0.2640 ( 6.90%) Amean 12 0.4393 ( 0.00%) 0.4843 (-10.24%) 0.4090 ( 6.89%) Amean 21 0.5821 ( 0.00%) 0.6369 ( -9.40%) 0.5126 ( 11.95%) Amean 30 0.6557 ( 0.00%) 0.6459 ( 1.50%) 0.5711 ( 12.90%) Amean 48 0.7924 ( 0.00%) 0.7760 ( 2.07%) 0.6286 ( 20.68%) Amean 79 1.0534 ( 0.00%) 1.0551 ( -0.16%) 0.8481 ( 19.49%) Amean 110 1.5286 ( 0.00%) 1.4504 ( 5.11%) 1.1121 ( 27.24%) Amean 141 1.9507 ( 0.00%) 1.7790 ( 8.80%) 1.3804 ( 29.23%) Amean 172 2.2261 ( 0.00%) 2.3330 ( -4.80%) 1.6336 ( 26.62%) Amean 192 2.3753 ( 0.00%) 2.3307 ( 1.88%) 1.8246 ( 23.19%) Somewhat surprisingly, I can see improvements for UMA machines with fewer cores when the workload heavily saturates the machine and the workload isn't dominated by fork. Such heavy saturation isn't super realistic, but still interesting. I haven't dug into why these results occurred, but I am happy things didn't instead fall off a cliff. Here's a 4-cpu UMA box showing some improvement at the higher end, hackbench-process-pipes 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6 tip-sched fix-fig-for-fork fix-sig Amean 1 3.5060 ( 0.00%) 3.5747 ( -1.96%) 3.5117 ( -0.16%) Amean 3 7.7113 ( 0.00%) 7.8160 ( -1.36%) 7.7747 ( -0.82%) Amean 5 11.4453 ( 0.00%) 11.5710 ( -1.10%) 11.3870 ( 0.51%) Amean 7 15.3147 ( 0.00%) 15.9420 ( -4.10%) 15.8450 ( -3.46%) Amean 12 25.5110 ( 0.00%) 24.3410 ( 4.59%) 22.6717 ( 11.13%) Amean 16 32.3010 ( 0.00%) 28.5897 ( 11.49%) 25.7473 ( 20.29%)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 820a787..ecb5ee8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5395,16 +5395,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, { struct sched_group *idlest = NULL, *group = sd->groups; struct sched_group *most_spare_sg = NULL; - unsigned long min_load = ULONG_MAX, this_load = 0; + unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0; + unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0; unsigned long most_spare = 0, this_spare = 0; int load_idx = sd->forkexec_idx; - int imbalance = 100 + (sd->imbalance_pct-100)/2; + int imbalance_scale = 100 + (sd->imbalance_pct-100)/2; + unsigned long imbalance = scale_load_down(NICE_0_LOAD) * + (sd->imbalance_pct-100) / 100; if (sd_flag & SD_BALANCE_WAKE) load_idx = sd->wake_idx; do { - unsigned long load, avg_load, spare_cap, max_spare_cap; + unsigned long load, avg_load, runnable_load; + unsigned long spare_cap, max_spare_cap; int local_group; int i; @@ -5421,6 +5425,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, * the group containing the CPU with most spare capacity. */ avg_load = 0; + runnable_load = 0; max_spare_cap = 0; for_each_cpu(i, sched_group_cpus(group)) { @@ -5430,7 +5435,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, else load = target_load(i, load_idx); - avg_load += load; + runnable_load += load; + + avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs); spare_cap = capacity_spare_wake(i, p); @@ -5439,14 +5446,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, } /* Adjust by relative CPU capacity of the group */ - avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity; + avg_load = (avg_load * SCHED_CAPACITY_SCALE) / + group->sgc->capacity; + runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / + group->sgc->capacity; if (local_group) { - this_load = avg_load; + this_runnable_load = runnable_load; + this_avg_load = avg_load; this_spare = max_spare_cap; } else { - if (avg_load < min_load) { - min_load = avg_load; + if (min_runnable_load > (runnable_load + imbalance)) { + /* + * The runnable load is significantly smaller + * so we can pick this new cpu + */ + min_runnable_load = runnable_load; + min_avg_load = avg_load; + idlest = group; + } else if ((runnable_load < (min_runnable_load + imbalance)) && + (100*min_avg_load > imbalance_scale*avg_load)) { + /* + * The runnable loads are close so we take + * into account blocked load through avg_load + * which is blocked + runnable load + */ + min_avg_load = avg_load; idlest = group; } @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, goto no_spare; if (this_spare > task_util(p) / 2 && - imbalance*this_spare > 100*most_spare) + imbalance_scale*this_spare > 100*most_spare) return NULL; else if (most_spare > task_util(p) / 2) return most_spare_sg; no_spare: - if (!idlest || 100*this_load < imbalance*min_load) + if (!idlest || + (min_runnable_load > (this_runnable_load + imbalance)) || + ((this_runnable_load < (min_runnable_load + imbalance)) && + (100*min_avg_load > imbalance_scale*this_avg_load))) return NULL; return idlest; }
find_idlest_group() only compares the runnable_load_avg when looking for the least loaded group. But on fork intensive use case like hackbench where tasks blocked quickly after the fork, this can lead to selecting the same CPU instead of other CPUs, which have similar runnable load but a lower load_avg. When the runnable_load_avg of 2 CPUs are close, we now take into account the amount of blocked load as a 2nd selection factor. There is now 3 zones for the runnable_load of the rq: -[0 .. (runnable_load - imbalance)] : Select the new rq which has significantly less runnable_load -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The runnable load are close so we use load_avg to chose between the 2 rq -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which has significantly less runnable_load For use case like hackbench, this enable the scheduler to select different CPUs during the fork sequence and to spread tasks across the system. Tests have been done on a Hikey board (ARM based octo cores) for several kernel. The result below gives min, max, avg and stdev values of 18 runs with each configuration. The v4.8+patches configuration also includes the changes below which is part of the proposal made by Peter to ensure that the clock will be up to date when the fork task will be attached to the rq. @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p) __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif rq = __task_rq_lock(p, &rf); + update_rq_clock(rq); post_init_entity_util_avg(&p->se); activate_task(rq, p, 0); hackbench -P -g 1 ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches min 0.049 0.050 0.051 0,048 avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%) max 0.066 0.068 0.070 0,063 stdev +/-9% +/-9% +/-8% +/-9% Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- Changes since v2: - Rebase on latest sched/core - Get same results with the rebase and the fix mentioned in patch 01 kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 38 insertions(+), 10 deletions(-) -- 2.7.4