Message ID | 5760115C.7040306@arm.com |
---|---|
State | New |
Headers | show |
On 14/06/16 17:40, Mike Galbraith wrote: > On Tue, 2016-06-14 at 15:14 +0100, Dietmar Eggemann wrote: > >> IMHO, the hackbench performance "boost" w/o 0905f04eb21f is due to the >> fact that a new task gets all it's load decayed (making it a small task) >> in the __update_load_avg() call in remove_entity_load_avg() because its >> se->avg.last_update_time value is 0 which creates a huge time difference >> comparing it to cfs_rq->avg.last_update_time. The patch 0905f04eb21f >> avoids this and thus the task stays big se->avg.load_avg = 1024. > > I don't care much at all about the hackbench "regression" in its own > right, and what causes it, for me, bottom line is that there are cases > where we need to be able to resolve, and can't, simply because we're > looking at a fuzzy (rippling) reflection. Understood. I just thought it would be nice to know why 0905f04eb21f makes this problem even more visible. But so far I wasn't able to figure out why this diff in se->avg.load_avg [1024 versus 0] has this effect on cfs_rq->runnable_load_avg making it even less suitable in find idlest*. enqueue_entity_load_avg()'s cfs_rq->runnable_load_* += sa->load_* looks suspicious though. > > In general, the fuzz helps us to not be so spastic. I'm not sure that > we really really need to care all that much, because I strongly suspect > that it's only gonna make any difference at all in corner cases, but > there are real world cases that matter. I know for fact that schbench > (facebook) which is at least based on a real world load fails early due > to us stacking tasks due to that fuzzy view of reality. In that case, > it's because the fuzz consists of a high amplitude aging sawtooth.. ... only for fork/exec? Which then would be related to the initial value of se->avg.load_avg. Otherwise we could go back to pre b92486cbf2aa "sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_task". > find idlest* sees a collection of pesudo-random numbers, effectively, > the fates pick idlest via lottery, get it wrong often enough that a big > box _never_ reaches full utilization before we stack tasks, putting an > end to the latency game. For generic loads, the smoothing works, but > for some corners, it blows chunks. Fork/exec seemed like a spot where > you really can't go wrong by looking at clear unadulterated reality. > > -Mike >
On 15/06/16 17:03, Mike Galbraith wrote: > On Wed, 2016-06-15 at 16:32 +0100, Dietmar Eggemann wrote: > >>> In general, the fuzz helps us to not be so spastic. I'm not sure that >>> we really really need to care all that much, because I strongly suspect >>> that it's only gonna make any difference at all in corner cases, but >>> there are real world cases that matter. I know for fact that schbench >>> (facebook) which is at least based on a real world load fails early due >>> to us stacking tasks due to that fuzzy view of reality. In that case, >>> it's because the fuzz consists of a high amplitude aging sawtooth.. >> >> ... only for fork/exec? > > No. Identical workers had longish work/sleep cycle, aging resulted in > weights that ranged from roughly 300-700(ish), depending on when you > peeked at them. > > -Mike > Isn't there a theoretical problem with the scale_load() on CONFIG_64BIT machines on tip/sched/core? load.weight has a higher resolution than runnable_load_avg (and so the values in the rq->cpu_load[] array). Theoretically because [forkexec|wake]_idx is 0 so [target|source]_load() is nothing else than weighted_cpuload().
On 16/06/16 04:33, Mike Galbraith wrote: > On Wed, 2016-06-15 at 20:03 +0100, Dietmar Eggemann wrote: > >> Isn't there a theoretical problem with the scale_load() on CONFIG_64BIT >> machines on tip/sched/core? load.weight has a higher resolution than >> runnable_load_avg (and so the values in the rq->cpu_load[] array). >> Theoretically because [forkexec|wake]_idx is 0 so [target|source]_load() >> is nothing else than weighted_cpuload(). > > I see a not so theoretical problem with my rfc in that I forgot to > scale_load_down() if that's what you mean. Yup. Theoretical in the sense that this_load and min_load will be affected both the same way as long as load_idx = 0. > > (changes nothing, reality was just extra special unadulterated;) Agreed. > > -Mike >
On 04/07/16 16:04, Matt Fleming wrote: > On Wed, 15 Jun, at 04:32:58PM, Dietmar Eggemann wrote: >> On 14/06/16 17:40, Mike Galbraith wrote: >>> On Tue, 2016-06-14 at 15:14 +0100, Dietmar Eggemann wrote: >>> >>>> IMHO, the hackbench performance "boost" w/o 0905f04eb21f is due to the >>>> fact that a new task gets all it's load decayed (making it a small task) >>>> in the __update_load_avg() call in remove_entity_load_avg() because its >>>> se->avg.last_update_time value is 0 which creates a huge time difference >>>> comparing it to cfs_rq->avg.last_update_time. The patch 0905f04eb21f >>>> avoids this and thus the task stays big se->avg.load_avg = 1024. >>> >>> I don't care much at all about the hackbench "regression" in its own >>> right, and what causes it, for me, bottom line is that there are cases >>> where we need to be able to resolve, and can't, simply because we're >>> looking at a fuzzy (rippling) reflection. >> >> Understood. I just thought it would be nice to know why 0905f04eb21f >> makes this problem even more visible. But so far I wasn't able to figure >> out why this diff in se->avg.load_avg [1024 versus 0] has this effect on >> cfs_rq->runnable_load_avg making it even less suitable in find idlest*. >> enqueue_entity_load_avg()'s cfs_rq->runnable_load_* += sa->load_* looks >> suspicious though. > > In my testing without 0905f04eb21f I saw that se->avg.load_avg > actually managed to skip being decayed at all before the task was > dequeued, which meant that cfs_rq->runnable_load_avg was more likely > to be zero after dequeue, for those workloads like hackbench that > essentially are just a fork bomb. Do you mean the first dequeue when the task is forked? These are the pelt related functions which are called when the task is forked: detach_entity_load_avg attach_entity_load_avg remove_entity_load_avg <-- se->avg.load_avg is set to 0 w/o 0905f04eb21f se->avg.load_avg stays 1024 w/ 0905f04eb21f enqueue_entity_load_avg attach_entity_load_avg (double attach is fixed on tip/sched/core) dequeue_entity_load_avg > se->avg.load_avg evaded decay because se->avg.period_contrib was being > zero'd in __update_load_avg(). I don't see the relation to se->avg.period_contrib here. IMHO, se->avg.period_contrib is purely there to manage the 3 different update phases in __update_load_avg(). This difference in the initial se->avg.load_avg value [0 or 1024] has an influence in wake_affine() [weight = p->se.avg.load_avg;] for the wakeup handling of the hackbench tasks in the 'send/receive data' phase. There are a couple of patches on tip/sched/core which might change the behaviour of this: fork path, no double attach_entity_load_avg for new task, no remove_entity_load_avg for new task, changes in effective_load ... > With 0905f04eb21f applied, it's less likely (though not impossible) > that ->period_contrib will be zero'd and so we usually end up with > some residual load in cfs_rq->runnable_load_avg on dequeue, and hence, > > cfs_rq->runnable_load_avg > se->avg.load_avg > > even if 'se' is the only task on the runqueue. > > FYI, below is my quick and dirty hack that restored hackbench > performance for the few machines I checked. I didn't try schbench with > it. > > --- > > From 4e9856ea3dc56e356195ca035dab7302754ce59b Mon Sep 17 00:00:00 2001 > From: Matt Fleming <matt@codeblueprint.co.uk> > Date: Thu, 9 Jun 2016 19:48:14 +0100 > Subject: [PATCH] sched/fair: Reset ::runnable_load_avg when dequeueing last > entity > > The task and runqueue load averages maintained in p->se.avg.load_avg > and cfs_rq->runnable_load_avg respectively, can decay at different > wall clock rates, which means that enqueueing and then dequeueing a > task on an otherwise empty runqueue doesn't always leave > ::runnable_load_avg with its initial value. > > This can lead to the situation where cfs_rq->runnable_load_avg has a > non-zero value even though there are no runnable entities on the > runqueue. Assuming no entity is enqueued on this runqueue for some > time this residual load average will decay gradually as the load > averages are updated. > > But we can optimise the special case of dequeueing the last entity and > reset ::runnable_load_avg early, which gives a performance improvement > to workloads that trigger the load balancer, such as fork-heavy > applications when SD_BALANCE_FORK is set, because it gives a more up > to date view of how busy the cpu is. > > Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> > --- > kernel/sched/fair.c | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c6dd8bab010c..408ee90c7ea8 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3007,10 +3007,20 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > static inline void > dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > + unsigned long load_avg = 0; > + > update_load_avg(se, 1); > > - cfs_rq->runnable_load_avg = > - max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0); > + /* > + * If we're about to dequeue the last runnable entity we can > + * reset the runnable load average to zero instead of waiting > + * for it to decay naturally. This gives the load balancer a > + * more timely and accurate view of how busy this cpu is. > + */ > + if (cfs_rq->nr_running > 1) > + load_avg = max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0); > + > + cfs_rq->runnable_load_avg = load_avg; > cfs_rq->runnable_load_sum = > max_t(s64, cfs_rq->runnable_load_sum - se->avg.load_sum, 0); > } >
--- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -680,7 +680,7 @@ void init_entity_runnable_average(struct sched_entity *se) * will definitely be update (after enqueue). */ sa->period_contrib = 1023; - sa->load_avg = scale_load_down(se->load.weight); + sa->load_avg = scale_load_down(0); sa->load_sum = sa->load_avg * LOAD_AVG_MAX;