Message ID | 20171205171018.9203-3-patrick.bellasi@arm.com |
---|---|
State | New |
Headers | show |
Series | Utilization estimation (util_est) for FAIR tasks | expand |
On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote: > + if (cfs_rq->nr_running > 0) { > + util_est = cfs_rq->util_est_runnable; > + util_est -= task_util_est(p); > + if (util_est < 0) > + util_est = 0; > + cfs_rq->util_est_runnable = util_est; > + } else { I'm thinking that's an explicit load-store to avoid intermediate values landing in cfs_rq->util_esp_runnable, right? That would need READ_ONCE() / WRITE_ONCE() I think, without that the compiler is free to munge the lot together.
On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote: > @@ -562,6 +577,12 @@ struct task_struct { > > const struct sched_class *sched_class; > struct sched_entity se; > + /* > + * Since we use se.avg.util_avg to update util_est fields, > + * this last can benefit from being close to se which > + * also defines se.avg as cache aligned. > + */ > + struct util_est util_est; > struct sched_rt_entity rt; > #ifdef CONFIG_CGROUP_SCHED > struct task_group *sched_task_group; > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index b19552a212de..8371839075fa 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -444,6 +444,7 @@ struct cfs_rq { > * CFS load tracking > */ > struct sched_avg avg; > + unsigned long util_est_runnable; > #ifndef CONFIG_64BIT > u64 load_last_update_time_copy; > #endif So you put the util_est in task_struct (not sched_entity) but the util_est_runnable in cfs_rq (not rq). Seems inconsistent.
On 13-Dec 17:19, Peter Zijlstra wrote: > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote: > > @@ -562,6 +577,12 @@ struct task_struct { > > > > const struct sched_class *sched_class; > > struct sched_entity se; > > + /* > > + * Since we use se.avg.util_avg to update util_est fields, > > + * this last can benefit from being close to se which > > + * also defines se.avg as cache aligned. > > + */ > > + struct util_est util_est; > > struct sched_rt_entity rt; > > #ifdef CONFIG_CGROUP_SCHED > > struct task_group *sched_task_group; > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index b19552a212de..8371839075fa 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -444,6 +444,7 @@ struct cfs_rq { > > * CFS load tracking > > */ > > struct sched_avg avg; > > + unsigned long util_est_runnable; > > #ifndef CONFIG_64BIT > > u64 load_last_update_time_copy; > > #endif > > > So you put the util_est in task_struct (not sched_entity) but the > util_est_runnable in cfs_rq (not rq). Seems inconsistent. One goal was to keep util_est variables close to the util_avg used to load the filter, for caches affinity sakes. The other goal was to have util_est data only for Tasks and CPU's RQ, thus avoiding unused data for TG's RQ and SE. Unfortunately the first goal does not allow to achieve completely the second and, you right, the solution looks a bit inconsistent. Do you think we should better disregard cache proximity and move util_est_runnable to rq? -- #include <best/regards.h> Patrick Bellasi
On Wed, Dec 13, 2017 at 04:36:53PM +0000, Patrick Bellasi wrote: > On 13-Dec 17:19, Peter Zijlstra wrote: > > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote: > > > @@ -562,6 +577,12 @@ struct task_struct { > > > > > > const struct sched_class *sched_class; > > > struct sched_entity se; > > > + /* > > > + * Since we use se.avg.util_avg to update util_est fields, > > > + * this last can benefit from being close to se which > > > + * also defines se.avg as cache aligned. > > > + */ > > > + struct util_est util_est; The thing is, since sched_entity has a member with cacheline alignment, the whole structure must have cacheline alignment, and this util_est _will_ start on a new line. See also: $ pahole -EC task_struct defconfig/kernel/sched/core.o ... struct sched_avg { /* typedef u64 */ long long unsigned int last_update_time; /* 576 8 */ /* typedef u64 */ long long unsigned int load_sum; /* 584 8 */ /* typedef u32 */ unsigned int util_sum; /* 592 4 */ /* typedef u32 */ unsigned int period_contrib; /* 596 4 */ long unsigned int load_avg; /* 600 8 */ long unsigned int util_avg; /* 608 8 */ } avg; /* 576 40 */ /* --- cacheline 6 boundary (384 bytes) --- */ } se; /* 192 448 */ /* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */ struct util_est { long unsigned int last; /* 640 8 */ long unsigned int ewma; /* 648 8 */ } util_est; /* 640 16 */ ... The thing is somewhat confused on which cacheline is which, but you'll see sched_avg landing at 576 (cacheline #9) and util_est at 640 (line #10). > > > struct sched_rt_entity rt; > > > #ifdef CONFIG_CGROUP_SCHED > > > struct task_group *sched_task_group; > One goal was to keep util_est variables close to the util_avg used to > load the filter, for caches affinity sakes. > > The other goal was to have util_est data only for Tasks and CPU's > RQ, thus avoiding unused data for TG's RQ and SE. > > Unfortunately the first goal does not allow to achieve completely the > second and, you right, the solution looks a bit inconsistent. > > Do you think we should better disregard cache proximity and move > util_est_runnable to rq? proximity is likely important; I'd suggest moving util_est into sched_entity.
On 13-Dec 18:03, Peter Zijlstra wrote: > On Wed, Dec 13, 2017 at 04:36:53PM +0000, Patrick Bellasi wrote: > > On 13-Dec 17:19, Peter Zijlstra wrote: > > > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote: > > > > @@ -562,6 +577,12 @@ struct task_struct { > > > > > > > > const struct sched_class *sched_class; > > > > struct sched_entity se; > > > > + /* > > > > + * Since we use se.avg.util_avg to update util_est fields, > > > > + * this last can benefit from being close to se which > > > > + * also defines se.avg as cache aligned. > > > > + */ > > > > + struct util_est util_est; > > The thing is, since sched_entity has a member with cacheline alignment, > the whole structure must have cacheline alignment, and this util_est > _will_ start on a new line. Right, I was not considering that "aligned" affects also the start of the following data. Thus > See also: > > $ pahole -EC task_struct defconfig/kernel/sched/core.o > > ... > struct sched_avg { > /* typedef u64 */ long long unsigned int last_update_time; /* 576 8 */ > /* typedef u64 */ long long unsigned int load_sum; /* 584 8 */ > /* typedef u32 */ unsigned int util_sum; /* 592 4 */ > /* typedef u32 */ unsigned int period_contrib; /* 596 4 */ > long unsigned int load_avg; /* 600 8 */ > long unsigned int util_avg; /* 608 8 */ > } avg; /* 576 40 */ > /* --- cacheline 6 boundary (384 bytes) --- */ > } se; /* 192 448 */ > /* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */ > struct util_est { > long unsigned int last; /* 640 8 */ > long unsigned int ewma; /* 648 8 */ > } util_est; /* 640 16 */ > ... > > The thing is somewhat confused on which cacheline is which, but you'll > see sched_avg landing at 576 (cacheline #9) and util_est at 640 (line > #10). > > > > > struct sched_rt_entity rt; > > > > #ifdef CONFIG_CGROUP_SCHED > > > > struct task_group *sched_task_group; > > > One goal was to keep util_est variables close to the util_avg used to > > load the filter, for caches affinity sakes. > > > > The other goal was to have util_est data only for Tasks and CPU's > > RQ, thus avoiding unused data for TG's RQ and SE. > > > > Unfortunately the first goal does not allow to achieve completely the > > second and, you right, the solution looks a bit inconsistent. > > > > Do you think we should better disregard cache proximity and move > > util_est_runnable to rq? > > proximity is likely important; I'd suggest moving util_est into > sched_entity. So, by moving util_est right after sched_avg, here is what we get (with some lines to better highlight 64B boundaries): const struct sched_class * sched_class; /* 152 8 */ struct sched_entity { [...] ---[ Line 9 ]------------------------------------------------------------------------------- struct sched_avg { /* typedef u64 */ long long unsigned int last_update_time; /* 576 8 */ /* typedef u64 */ long long unsigned int load_sum; /* 584 8 */ /* typedef u64 */ long long unsigned int runnable_load_sum; /* 592 8 */ /* typedef u32 */ unsigned int util_sum; /* 600 4 */ /* typedef u32 */ unsigned int period_contrib; /* 604 4 */ long unsigned int load_avg; /* 608 8 */ long unsigned int runnable_load_avg; /* 616 8 */ long unsigned int util_avg; /* 624 8 */ } avg; /* 576 56 */ /* --- cacheline 6 boundary (384 bytes) was 24 bytes ago --- */ struct util_est { long unsigned int last; /* 632 8 */ ---[ Line 10 ]------------------------------------------------------------------------------ long unsigned int ewma; /* 640 8 */ } util_est; /* 632 16 */ } se; /* 192 512 */ ---[ Line 11 ]------------------------------------------------------------------------------ /* --- cacheline 9 boundary (576 bytes) was 24 bytes ago --- */ struct sched_rt_entity { struct list_head { struct list_head * next; /* 704 8 */ struct list_head * prev; /* 712 8 */ } run_list; /* 704 16 */ As you can see we still end up with util_est spanning acrosss two cache and even worst with an almost empty Line 10. The point is that sched_avg already uses 56B... which leave just 8bytes left. So, I can to move util_est there and use unsigned int for "last" and "ewma" storage. This should fix the cache alignment but only until we do not add other stuff to sched_avg. BTW, should not be possible to use a similar "fasting" approach for load_avg and runnable_load_avg? Given their range a u32 should be just good enough, isn't it? Cheers Patrick -- #include <best/regards.h> Patrick Bellasi
On 13-Dec 17:16, Peter Zijlstra wrote: > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote: > > +static inline void util_est_dequeue(struct task_struct *p, int flags) > > +{ > > + struct cfs_rq *cfs_rq = &task_rq(p)->cfs; > > + unsigned long util_last = task_util(p); > > + bool sleep = flags & DEQUEUE_SLEEP; > > + unsigned long ewma; > > + long util_est; > > + > > + if (!sched_feat(UTIL_EST)) > > + return; > > + > > + /* > > + * Update root cfs_rq's estimated utilization > > + * > > + * If *p is the last task then the root cfs_rq's estimated utilization > > + * of a CPU is 0 by definition. > > + * > > + * Otherwise, in removing *p's util_est from its cfs_rq's > > + * util_est_runnable we should account for cases where this last > > + * activation of *p was longer then the previous ones. > > + * Also in these cases we need to set 0 the estimated utilization for > > + * the CPU. > > + */ > > + if (cfs_rq->nr_running > 0) { > > + util_est = cfs_rq->util_est_runnable; > > + util_est -= task_util_est(p); > > + if (util_est < 0) > > + util_est = 0; > > + cfs_rq->util_est_runnable = util_est; > > + } else { > > + cfs_rq->util_est_runnable = 0; > > + } > > + > > + /* > > + * Skip update of task's estimated utilization when the task has not > > + * yet completed an activation, e.g. being migrated. > > + */ > > + if (!sleep) > > + return; > > + > > + /* > > + * Skip update of task's estimated utilization when its EWMA is already > > + * ~1% close to its last activation value. > > + */ > > + util_est = p->util_est.ewma; > > + if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100)) > > + return; > > Isn't that computation almost as expensive as the stuff you're trying to > avoid? Mmm... maybe slightly simpler. I'll profile it again but I remember I've added it because it was slightly better on backbench. This code at the end it's just a "sub" and a "compare to constant" and it's likely to bail early for all "almost regular" tasks. Are you worried about the branch overhead? > > + /* > > + * Update Task's estimated utilization > > + * > > + * When *p completes an activation we can consolidate another sample > > + * about the task size. This is done by storing the last PELT value > > + * for this task and using this value to load another sample in the > > + * exponential weighted moving average: > > + * > > + * ewma(t) = w * task_util(p) + (1 - w) ewma(t-1) > > + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1) > > + * = w * (task_util(p) + ewma(t-1) / w - ewma(t-1)) > > + * > > + * Where 'w' is the weight of new samples, which is configured to be > > + * 0.25, thus making w=1/4 > > + */ > > + p->util_est.last = util_last; > > + ewma = p->util_est.ewma; > > + if (likely(ewma != 0)) { > > Why special case 0? Yes it helps with the initial ramp-on, but would not > an asymmetric IIR (with a consistent upward bias) be better? Yes, maybe the fast ramp-up is not really necessary... I'll test it without on some real use-cases and see if we really get any noticiable benefit, otheriwse I'll remove it. Thanks for pointing this out. > > + ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma; > > + ewma >>= UTIL_EST_WEIGHT_SHIFT; > > + } else { > > + ewma = util_last; > > + } > > + p->util_est.ewma = ewma; > > +} -- #include <best/regards.h> Patrick Bellasi
On Fri, Dec 15, 2017 at 12:03:31PM +0000, Patrick Bellasi wrote: > So, by moving util_est right after sched_avg, here is what we get (with some > lines to better highlight 64B boundaries): > > const struct sched_class * sched_class; /* 152 8 */ > struct sched_entity { > [...] > ---[ Line 9 ]------------------------------------------------------------------------------- > struct sched_avg { > /* typedef u64 */ long long unsigned int last_update_time; /* 576 8 */ > /* typedef u64 */ long long unsigned int load_sum; /* 584 8 */ > /* typedef u64 */ long long unsigned int runnable_load_sum; /* 592 8 */ > /* typedef u32 */ unsigned int util_sum; /* 600 4 */ > /* typedef u32 */ unsigned int period_contrib; /* 604 4 */ > long unsigned int load_avg; /* 608 8 */ > long unsigned int runnable_load_avg; /* 616 8 */ > long unsigned int util_avg; /* 624 8 */ > } avg; /* 576 56 */ > /* --- cacheline 6 boundary (384 bytes) was 24 bytes ago --- */ > struct util_est { > long unsigned int last; /* 632 8 */ > ---[ Line 10 ]------------------------------------------------------------------------------ > long unsigned int ewma; /* 640 8 */ > } util_est; /* 632 16 */ > } se; /* 192 512 */ > ---[ Line 11 ]------------------------------------------------------------------------------ > /* --- cacheline 9 boundary (576 bytes) was 24 bytes ago --- */ > struct sched_rt_entity { > struct list_head { > struct list_head * next; /* 704 8 */ > struct list_head * prev; /* 712 8 */ > } run_list; /* 704 16 */ > > > As you can see we still end up with util_est spanning acrosss two cache and > even worst with an almost empty Line 10. The point is that sched_avg already > uses 56B... which leave just 8bytes left. Yes, that's unfortunate. > So, I can to move util_est there and use unsigned int for "last" and "ewma" > storage. This should fix the cache alignment but only until we do not add > other stuff to sched_avg. > > BTW, should not be possible to use a similar "fasting" approach for load_avg > and runnable_load_avg? Given their range a u32 should be just good enough, > isn't it? Probably, I'd have to page all that stuff back in :/ Another issue is that for tasks load and runnable_load are the exact same; I just never found a sensible way to collapse that.
On Fri, Dec 15, 2017 at 02:02:18PM +0000, Patrick Bellasi wrote: > On 13-Dec 17:05, Peter Zijlstra wrote: > > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote: > > > + if (cfs_rq->nr_running > 0) { > > > + util_est = cfs_rq->util_est_runnable; > > > + util_est -= task_util_est(p); > > > + if (util_est < 0) > > > + util_est = 0; > > > + cfs_rq->util_est_runnable = util_est; > > > + } else { > > > > I'm thinking that's an explicit load-store to avoid intermediate values > > landing in cfs_rq->util_esp_runnable, right? > > Was mainly to have an unsigned util_est for the following "sub"... > > > > That would need READ_ONCE() / WRITE_ONCE() I think, without that the > > compiler is free to munge the lot together. > > ... do we still need the {READ,WRITE}_ONCE() in this case? > I guess adding them however does not hurts. I think the compiler is free to munge it into something like: cfs_rq->util_est_runnable -= task_util_est(); if (cfs_rq->util_est_runnable < 0) cfs_rq->util_est_runnable = 0 and its a fairly simple optimization at that, it just needs to observe that util_est is an alias for cfs_rq->util_est_runnable. Using the volatile load/store completely destroys that optimization though, so yes, I'd say its definitely needed.
On 15-Dec 15:07, Peter Zijlstra wrote: > On Fri, Dec 15, 2017 at 02:02:18PM +0000, Patrick Bellasi wrote: > > On 13-Dec 17:05, Peter Zijlstra wrote: > > > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote: > > > > + if (cfs_rq->nr_running > 0) { > > > > + util_est = cfs_rq->util_est_runnable; > > > > + util_est -= task_util_est(p); > > > > + if (util_est < 0) > > > > + util_est = 0; > > > > + cfs_rq->util_est_runnable = util_est; > > > > + } else { > > > > > > I'm thinking that's an explicit load-store to avoid intermediate values > > > landing in cfs_rq->util_esp_runnable, right? > > > > Was mainly to have an unsigned util_est for the following "sub"... > > > > > > > That would need READ_ONCE() / WRITE_ONCE() I think, without that the > > > compiler is free to munge the lot together. > > > > ... do we still need the {READ,WRITE}_ONCE() in this case? > > I guess adding them however does not hurts. > This is just to better understand.... > I think the compiler is free to munge it into something like: > > cfs_rq->util_est_runnable -= task_util_est(); > if (cfs_rq->util_est_runnable < 0) > cfs_rq->util_est_runnable = 0 > I'm still confused, we have: long util_est unsigned long cfs_rq->util_est_runnable The optimization above can potentially generate an overflow, isn't it? > and its a fairly simple optimization at that, it just needs to observe > that util_est is an alias for cfs_rq->util_est_runnable. Since the first is signed and the last unsigned, can the compiler still considered them an alias? At least on ARM I would expect a load of an unsigned value, some computations on "signed registers", and finally a store of an unsigned value. This is what I get: if (cfs_rq->nr_running > 0) { 51e4: 91020000 add x0, x0, #0x80 51e8: b9401802 ldr w2, [x0,#24] 51ec: 340004a2 cbz w2, 5280 <dequeue_task_fair+0xb20> // skip branch for nr_running == 0 return max(p->util_est.ewma, p->util_est.last); 51f0: f9403ba2 ldr x2, [x29,#112] 51f4: f9418044 ldr x4, [x2,#768] 51f8: f9418443 ldr x3, [x2,#776] // x3 := p->util_est.ewma // x4 := p->util_est.last util_est -= task_util_est(p); 51fc: f9405002 ldr x2, [x0,#160] // x2 := cfs_rq->util_est_runnable return max(p->util_est.ewma, p->util_est.last); 5200: eb04007f cmp x3, x4 5204: 9a842063 csel x3, x3, x4, cs // x3 := task_util_est(p) := max(p->util_est.ewma, p->util_est.last) cfs_rq->util_est_runnable = util_est; 5208: eb030042 subs x2, x2, x3 // x2 := util_est -= task_util_est(p); 520c: 9a9f5042 csel x2, x2, xzr, pl // x2 := max(0, util_est) 5210: f9005002 str x2, [x0,#160] // store back into cfs_rq->util_est_runnable And by adding {READ,WRITE}_ONCE I still get the same code. While, compiling for x86, I get two different versions, here is the one without {READ,WRITE}_ONCE: if (cfs_rq->nr_running > 0) { 3e3e: 8b 90 98 00 00 00 mov 0x98(%rax),%edx 3e44: 85 d2 test %edx,%edx 3e46: 0f 84 e0 00 00 00 je 3f2c <dequeue_task_fair+0xf9c> util_est = cfs_rq->util_est_runnable; util_est -= task_util_est(p); 3e4c: 48 8b 74 24 28 mov 0x28(%rsp),%rsi 3e51: 48 8b 96 80 02 00 00 mov 0x280(%rsi),%rdx 3e58: 48 39 96 88 02 00 00 cmp %rdx,0x288(%rsi) 3e5f: 48 0f 43 96 88 02 00 cmovae 0x288(%rsi),%rdx 3e66: 00 if (util_est < 0) util_est = 0; cfs_rq->util_est_runnable = util_est; 3e67: 48 8b b0 20 01 00 00 mov 0x120(%rax),%rsi 3e6e: 48 29 d6 sub %rdx,%rsi 3e71: 48 89 f2 mov %rsi,%rdx 3e74: be 00 00 00 00 mov $0x0,%esi 3e79: 48 0f 48 d6 cmovs %rsi,%rdx 3e7d: 48 89 90 20 01 00 00 mov %rdx,0x120(%rax) but I'm not confident on "parsing it"... > Using the volatile load/store completely destroys that optimization > though, so yes, I'd say its definitely needed. Ok, since it's definitively not harmful, I'll add it. -- #include <best/regards.h> Patrick Bellasi
On 15-Dec 13:53, Peter Zijlstra wrote: > On Fri, Dec 15, 2017 at 12:14:17PM +0000, Patrick Bellasi wrote: > > On 13-Dec 17:16, Peter Zijlstra wrote: > > > > > + /* > > > > + * Skip update of task's estimated utilization when its EWMA is already > > > > + * ~1% close to its last activation value. > > > > + */ > > > > + util_est = p->util_est.ewma; > > > > + if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100)) > > > > + return; > > > > > > Isn't that computation almost as expensive as the stuff you're trying to > > > avoid? > > > > Mmm... maybe slightly simpler. I'll profile it again but I remember > > I've added it because it was slightly better on backbench. > > > > This code at the end it's just a "sub" and a "compare to constant" and > > it's likely to bail early for all "almost regular" tasks. > > > > Are you worried about the branch overhead? > > Its a subtract, a test for sign, a conditional branch on test, a negate, > a subtract constant and another conditinoal branch. Close enough, the actual code is: util_est = p->util_est.ewma; 5218: f9403ba3 ldr x3, [x29,#112] 521c: f9418462 ldr x2, [x3,#776] if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100)) 5220: eb010040 subs x0, x2, x1 5224: da805400 cneg x0, x0, mi 5228: f100281f cmp x0, #0xa 522c: 54fff9cd b.le 5164 <dequeue_task_fair+0xa04> > > Branch overhead certainly matters too. > > > > > + p->util_est.last = util_last; > > > > + ewma = p->util_est.ewma; > > > > + if (likely(ewma != 0)) { > > > > > > Why special case 0? Yes it helps with the initial ramp-on, but would not > > > an asymmetric IIR (with a consistent upward bias) be better? > > > > Yes, maybe the fast ramp-up is not really necessary... I'll test it > > without on some real use-cases and see if we really get any noticiable > > benefit, otheriwse I'll remove it. > > > > Thanks for pointing this out. > > > > > > + ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma; > > > > + ewma >>= UTIL_EST_WEIGHT_SHIFT; > > > > + } else { > > > > + ewma = util_last; > > > > + } > > > > + p->util_est.ewma = ewma; > > And this, without the 0 case, is shift, an add, a subtract and another > shift followed by a store. Actual code: p->util_est.last = util_last; 5230: f9018061 str x1, [x3,#768] if (likely(ewma != 0)) { 5234: b40000a2 cbz x2, 5248 <dequeue_task_fair+0xae8> ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma; 5238: d37ef440 lsl x0, x2, #2 523c: cb020002 sub x2, x0, x2 5240: 8b010041 add x1, x2, x1 ewma >>= UTIL_EST_WEIGHT_SHIFT; 5244: d342fc21 lsr x1, x1, #2 p->util_est.ewma = ewma; 5248: f9403ba0 ldr x0, [x29,#112] 524c: f9018401 str x1, [x0,#776] 5250: 17ffffc5 b 5164 <dequeue_task_fair+0xa04> > > Which is less branches and roughly similar arith ops, some of which can > be done in parallel. > > I suspect what you saw on the profile is the cacheline hit of the store, > but I'm not sure. Yes likely, looking at the two chunks above and considering the removal of the 0 case, it's probably worth to remove the first check. I'll give it a try again to measure hackbench overheads with the cache alignment fixed. Cheers Patrick -- #include <best/regards.h> Patrick Bellasi
On Fri, Dec 15, 2017 at 03:41:40PM +0000, Patrick Bellasi wrote: > Close enough, the actual code is: > > util_est = p->util_est.ewma; > 5218: f9403ba3 ldr x3, [x29,#112] > 521c: f9418462 ldr x2, [x3,#776] > if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100)) > 5220: eb010040 subs x0, x2, x1 > 5224: da805400 cneg x0, x0, mi > 5228: f100281f cmp x0, #0xa > 522c: 54fff9cd b.le 5164 <dequeue_task_fair+0xa04> Ah, that cneg instruction is cute; on x86 we end up with something like: bool abs_test(long s) { return abs(s) < 32; } cmpl $-31, %eax jl .L107 movq -8(%rbp), %rax cmpl $31, %eax jg .L107 movl $1, %eax jmp .L108 .L107: movl $0, %eax .L108: But I figured you can actually do: abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1) Which, if y is a constant, should result in nicer code, and it does for x86: addq $31, %rax cmpq $62, %rax setbe %al movzbl %al, %eax Just not measurably faster, I suppose because of all the dependencies :/
On Wed, Dec 20, 2017 at 09:57:47AM +0100, Peter Zijlstra wrote: > On Fri, Dec 15, 2017 at 03:41:40PM +0000, Patrick Bellasi wrote: > > Close enough, the actual code is: > > > > util_est = p->util_est.ewma; > > 5218: f9403ba3 ldr x3, [x29,#112] > > 521c: f9418462 ldr x2, [x3,#776] > > if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100)) > > 5220: eb010040 subs x0, x2, x1 > > 5224: da805400 cneg x0, x0, mi > > 5228: f100281f cmp x0, #0xa > > 522c: 54fff9cd b.le 5164 <dequeue_task_fair+0xa04> > > Ah, that cneg instruction is cute; on x86 we end up with something like: > > bool abs_test(long s) > { > return abs(s) < 32; > } > > cmpl $-31, %eax > jl .L107 > movq -8(%rbp), %rax > cmpl $31, %eax > jg .L107 > movl $1, %eax > jmp .L108 > .L107: > movl $0, %eax > .L108: > > > But I figured you can actually do: > > abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1) > > Which, if y is a constant, should result in nicer code, and it does for > x86: > > addq $31, %rax > cmpq $62, %rax > setbe %al > movzbl %al, %eax > > Just not measurably faster, I suppose because of all the dependencies :/ Ah no, it actually is, I'm an idiot and used 'long' for return value. If I use bool we loose that last movzbl and we go from around 4.0 cycles down to 3.4 cycles.
diff --git a/include/linux/sched.h b/include/linux/sched.h index 21991d668d35..b01c0dc75ef5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -338,6 +338,21 @@ struct sched_avg { unsigned long util_avg; }; +/** + * Estimation Utilization for FAIR tasks. + * + * Support data structure to track an Exponential Weighted Moving Average + * (EWMA) of a FAIR task's utilization. New samples are added to the moving + * average each time a task completes an activation. Sample's weight is + * chosen so that the EWMA will be relatively insensitive to transient changes + * to the task's workload. + */ +struct util_est { + unsigned long last; + unsigned long ewma; +#define UTIL_EST_WEIGHT_SHIFT 2 +}; + struct sched_statistics { #ifdef CONFIG_SCHEDSTATS u64 wait_start; @@ -562,6 +577,12 @@ struct task_struct { const struct sched_class *sched_class; struct sched_entity se; + /* + * Since we use se.avg.util_avg to update util_est fields, + * this last can benefit from being close to se which + * also defines se.avg as cache aligned. + */ + struct util_est util_est; struct sched_rt_entity rt; #ifdef CONFIG_CGROUP_SCHED struct task_group *sched_task_group; diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 1ca0130ed4f9..5ffa8234524a 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -567,6 +567,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) cfs_rq->avg.runnable_load_avg); SEQ_printf(m, " .%-30s: %lu\n", "util_avg", cfs_rq->avg.util_avg); + SEQ_printf(m, " .%-30s: %lu\n", "util_est_runnable", + cfs_rq->util_est_runnable); SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg", cfs_rq->removed.load_avg); SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg", @@ -1018,6 +1020,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, P(se.avg.runnable_load_avg); P(se.avg.util_avg); P(se.avg.last_update_time); + P(util_est.ewma); + P(util_est.last); #endif P(policy); P(prio); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ad21550d008c..d8f3ed71010b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -732,6 +732,12 @@ void init_entity_runnable_average(struct sched_entity *se) se->runnable_weight = se->load.weight; /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */ + + /* Utilization estimation */ + if (entity_is_task(se)) { + task_of(se)->util_est.ewma = 0; + task_of(se)->util_est.last = 0; + } } static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); @@ -5153,6 +5159,20 @@ static inline void hrtick_update(struct rq *rq) } #endif +static inline unsigned long task_util(struct task_struct *p); +static inline unsigned long task_util_est(struct task_struct *p); + +static inline void util_est_enqueue(struct task_struct *p) +{ + struct cfs_rq *cfs_rq = &task_rq(p)->cfs; + + if (!sched_feat(UTIL_EST)) + return; + + /* Update root cfs_rq's estimated utilization */ + cfs_rq->util_est_runnable += task_util_est(p); +} + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -5205,9 +5225,84 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (!se) add_nr_running(rq, 1); + util_est_enqueue(p); hrtick_update(rq); } +static inline void util_est_dequeue(struct task_struct *p, int flags) +{ + struct cfs_rq *cfs_rq = &task_rq(p)->cfs; + unsigned long util_last = task_util(p); + bool sleep = flags & DEQUEUE_SLEEP; + unsigned long ewma; + long util_est; + + if (!sched_feat(UTIL_EST)) + return; + + /* + * Update root cfs_rq's estimated utilization + * + * If *p is the last task then the root cfs_rq's estimated utilization + * of a CPU is 0 by definition. + * + * Otherwise, in removing *p's util_est from its cfs_rq's + * util_est_runnable we should account for cases where this last + * activation of *p was longer then the previous ones. + * Also in these cases we need to set 0 the estimated utilization for + * the CPU. + */ + if (cfs_rq->nr_running > 0) { + util_est = cfs_rq->util_est_runnable; + util_est -= task_util_est(p); + if (util_est < 0) + util_est = 0; + cfs_rq->util_est_runnable = util_est; + } else { + cfs_rq->util_est_runnable = 0; + } + + /* + * Skip update of task's estimated utilization when the task has not + * yet completed an activation, e.g. being migrated. + */ + if (!sleep) + return; + + /* + * Skip update of task's estimated utilization when its EWMA is already + * ~1% close to its last activation value. + */ + util_est = p->util_est.ewma; + if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100)) + return; + + /* + * Update Task's estimated utilization + * + * When *p completes an activation we can consolidate another sample + * about the task size. This is done by storing the last PELT value + * for this task and using this value to load another sample in the + * exponential weighted moving average: + * + * ewma(t) = w * task_util(p) + (1 - w) ewma(t-1) + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1) + * = w * (task_util(p) + ewma(t-1) / w - ewma(t-1)) + * + * Where 'w' is the weight of new samples, which is configured to be + * 0.25, thus making w=1/4 + */ + p->util_est.last = util_last; + ewma = p->util_est.ewma; + if (likely(ewma != 0)) { + ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma; + ewma >>= UTIL_EST_WEIGHT_SHIFT; + } else { + ewma = util_last; + } + p->util_est.ewma = ewma; +} + static void set_next_buddy(struct sched_entity *se); /* @@ -5264,6 +5359,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (!se) sub_nr_running(rq, 1); + util_est_dequeue(p, flags); hrtick_update(rq); } @@ -5721,7 +5817,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, return affine; } -static inline unsigned long task_util(struct task_struct *p); static unsigned long cpu_util_wake(int cpu, struct task_struct *p); static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) @@ -6216,6 +6311,11 @@ static inline unsigned long task_util(struct task_struct *p) return p->se.avg.util_avg; } +static inline unsigned long task_util_est(struct task_struct *p) +{ + return max(p->util_est.ewma, p->util_est.last); +} + /* * cpu_util_wake: Compute cpu utilization with any contributions from * the waking task p removed. diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 9552fd5854bf..e9f312acc0d3 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true) SCHED_FEAT(WA_IDLE, true) SCHED_FEAT(WA_WEIGHT, true) SCHED_FEAT(WA_BIAS, true) + +/* + * UtilEstimation. Use estimated CPU utiliation. + */ +SCHED_FEAT(UTIL_EST, false) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b19552a212de..8371839075fa 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -444,6 +444,7 @@ struct cfs_rq { * CFS load tracking */ struct sched_avg avg; + unsigned long util_est_runnable; #ifndef CONFIG_64BIT u64 load_last_update_time_copy; #endif