Message ID | 1413958051-7103-7-git-send-email-mturquette@linaro.org |
---|---|
State | New |
Headers | show |
Hi Mike, On 10/22/2014 11:37 AM, Mike Turquette wrote: > {en,de}queue_task_fair are updated to track which cpus will have changed > utilization values as function of task queueing. The affected cpus are > passed on to arch_eval_cpu_freq for further machine-specific processing > based on a selectable policy. > > arch_scale_cpu_freq is called from run_rebalance_domains as a way to > kick off the scaling process (via wake_up_process), so as to prevent > re-entering the {en,de}queue code. > > All of the call sites in this patch are up for discussion. Does it make > sense to track which cpus have updated statistics in enqueue_fair_task? > I chose this because I wanted to gather statistics for all cpus affected > in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the Can you explain how pstate selection can get affected by the presence of task groups? We are after all concerned with the cpu load. So when we enqueue/dequeue a task, we update the cpu load and pass it on for cpu pstate scaling. How does this change if we have task groups? I know that this issue was brought up during LPC, but I have yet not managed to gain clarity here. > next version of this patch will focus on the simpler case of not using > scheduler cgroups, which should remove a good chunk of this code, > including the cpumask stuff. > > Also discussed at LPC14 is that fact that load_balance is a very > interesting place to do this as frequency can be considered in concert > with task placement. Please put forth any ideas on a sensible way to do > this. > > Is run_rebalance_domains a logical place to change cpu frequency? What > other call sites make sense? > > Even for platforms that can target a cpu frequency without sleeping > (x86, some ARM platforms with PM microcontrollers) it is currently > necessary to always kick the frequency target work out into a kthread. > This is because of the rw_sem usage in the cpufreq core which might > sleep. Replacing that lock type is probably a good idea. > > Not-signed-off-by: Mike Turquette <mturquette@linaro.org> > --- > kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 39 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 1af6f6d..3619f63 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > { > struct cfs_rq *cfs_rq; > struct sched_entity *se = &p->se; > + struct cpumask update_cpus; > + > + cpumask_clear(&update_cpus); > > for_each_sched_entity(se) { > if (se->on_rq) > @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > update_cfs_shares(cfs_rq); > update_entity_load_avg(se, 1); > + /* track cpus that need to be re-evaluated */ > + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus); All the cfs_rqs that you iterate through here will belong to the same rq/cpu right? Regards Preeti U Murthy
On Tue, Oct 21, 2014 at 11:07:30PM -0700, Mike Turquette wrote: > {en,de}queue_task_fair are updated to track which cpus will have changed > utilization values as function of task queueing. The affected cpus are > passed on to arch_eval_cpu_freq for further machine-specific processing > based on a selectable policy. Yeah, I'm not sure about the arch eval hook, ideally it'd be all integrated with the energy model. > arch_scale_cpu_freq is called from run_rebalance_domains as a way to > kick off the scaling process (via wake_up_process), so as to prevent > re-entering the {en,de}queue code. We might want a better name for that :-) dvfs_set_freq() or whatnot, or maybe preserve the cpufreq_*() namespace, people seen to know that that is the linux dvfs name. > All of the call sites in this patch are up for discussion. Does it make > sense to track which cpus have updated statistics in enqueue_fair_task? Like I said, I don't think so, we guestimate and approximate everything anyhow, don't bother trying to be 'perfect' here, its excessively expensive. > I chose this because I wanted to gather statistics for all cpus affected > in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the > next version of this patch will focus on the simpler case of not using > scheduler cgroups, which should remove a good chunk of this code, > including the cpumask stuff. Yes please, make the cpumask stuff go away :-) > Also discussed at LPC14 is that fact that load_balance is a very > interesting place to do this as frequency can be considered in concert > with task placement. Please put forth any ideas on a sensible way to do > this. Ideally it'd be natural fallout of Morten's energy model. If you take a multi-core energy model, find its bifurcations and map its solution spaces I suspect there to be a fairly small set of actual behaviours. The problem is, nobody seems to have done this yet so we don't know. Once you've done this, you can try and minimize the model by proving you retain all behaviour modes, but for now Morten has a rather full parameter space (not complete though, and the impact of the missing parameters might or might not be relevant, impossible to prove until we have the above done). > Is run_rebalance_domains a logical place to change cpu frequency? What > other call sites make sense? For the legacy systems, maybe. > Even for platforms that can target a cpu frequency without sleeping > (x86, some ARM platforms with PM microcontrollers) it is currently > necessary to always kick the frequency target work out into a kthread. > This is because of the rw_sem usage in the cpufreq core which might > sleep. Replacing that lock type is probably a good idea. I think it would be best to start with this, ideally we'd be able to RCU free the thing such that either holding the rwsem or rcu_read_lock is sufficient for usage, that way the sleeping muck can grab the rwsem, the non-sleeping stuff can grab rcu_read_lock. But I've not looked at the cpufreq stuff at all.
Hi Mike, On 22/10/14 07:07, Mike Turquette wrote: > {en,de}queue_task_fair are updated to track which cpus will have changed > utilization values as function of task queueing. The sentence is a little bit misleading. We update the se utilization contrib and the cfs_rq utilization in {en,de}queue_task_fair for a specific se and a specific cpu = rq_of(cfs_rq_of(se))->cpu . > The affected cpus are > passed on to arch_eval_cpu_freq for further machine-specific processing > based on a selectable policy. I'm not sure if separating the evaluation and the setting of the cpu frequency makes sense. You could evaluate and possibly set the cpu frequency in one go. Right now you evaluate if the cfs_rq utilization exceeds the thresholds for the current index every time a task is enqueued or dequeued but that's not necessary since you only try to set the cpu frequency in the softirq. The history (and the future if we consider blocked utilization) is already captured in the cfs_rq utilization itself. > > arch_scale_cpu_freq is called from run_rebalance_domains as a way to > kick off the scaling process (via wake_up_process), so as to prevent > re-entering the {en,de}queue code. The name is misleading from the viewpoint of the CFS sched class. The original scaling function of the CFS scheduler (arch_scale_{freq,smt/cpu,rt}_capacity) scale capacity based on frequency, uarch or rt. So your function should be call arch_scale_util_cpu_freq or even better arch_set_cpu_freq. > > All of the call sites in this patch are up for discussion. Does it make > sense to track which cpus have updated statistics in enqueue_fair_task? Not really because cfs_rq utilization tracks the history/(future) of cpu utilization and you can evaluate the signal when you want to set the cpu frequency. > I chose this because I wanted to gather statistics for all cpus affected > in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the > next version of this patch will focus on the simpler case of not using > scheduler cgroups, which should remove a good chunk of this code, > including the cpumask stuff. I don't understand why you should care about task groups at all. The task groups contribution to the utilization of a cpu should be already encountered for in the appropriate cpu's cfs_rq utilization signal. But I can see a dependency to the fact that there is a difference between systems with per-cluster (package) or per-cpu frequency scaling. But there is no SD_SHARE_FREQDOMAIN (sched domain flag) today which applied to the SD level MC could tell you tahts we deal with per-cluster frequency scaling though. On systems with per-cpu frequency scaling you can set the frequency for individual cpus by hooking into the scheduler but for systems with per-cluster frequency scaling, you would have to respect the maximum cpu utilization of all cpus in the cluster. A similar problem occurs with hardware threads (SMT sd level). But I don't know right now how the sd topology hierarchy can become handy here. > > Also discussed at LPC14 is that fact that load_balance is a very > interesting place to do this as frequency can be considered in concert > with task placement. Please put forth any ideas on a sensible way to do > this. > > Is run_rebalance_domains a logical place to change cpu frequency? What > other call sites make sense? At least it's a good place to test this feature for now. > > Even for platforms that can target a cpu frequency without sleeping > (x86, some ARM platforms with PM microcontrollers) it is currently > necessary to always kick the frequency target work out into a kthread. > This is because of the rw_sem usage in the cpufreq core which might > sleep. Replacing that lock type is probably a good idea. > > Not-signed-off-by: Mike Turquette <mturquette@linaro.org> > --- > kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 39 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 1af6f6d..3619f63 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > { > struct cfs_rq *cfs_rq; > struct sched_entity *se = &p->se; > + struct cpumask update_cpus; > + > + cpumask_clear(&update_cpus); > > for_each_sched_entity(se) { > if (se->on_rq) > @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > update_cfs_shares(cfs_rq); > update_entity_load_avg(se, 1); > + /* track cpus that need to be re-evaluated */ > + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus); > } > > + /* !CONFIG_FAIR_GROUP_SCHED */ > if (!se) { > update_rq_runnable_avg(rq, rq->nr_running); > add_nr_running(rq, 1); > + > + /* > + * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to > + * typedef update_cpus into an int and skip all of the cpumask > + * stuff > + */ > + cpumask_set_cpu(cpu_of(rq), &update_cpus); > } > + > + if (energy_aware()) > + if (!cpumask_empty(&update_cpus)) > + arch_eval_cpu_freq(&update_cpus); > + > hrtick_update(rq); > } > > @@ -4049,6 +4067,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > struct cfs_rq *cfs_rq; > struct sched_entity *se = &p->se; > int task_sleep = flags & DEQUEUE_SLEEP; > + struct cpumask update_cpus; > + > + cpumask_clear(&update_cpus); > > for_each_sched_entity(se) { > cfs_rq = cfs_rq_of(se); > @@ -4089,12 +4110,27 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > update_cfs_shares(cfs_rq); > update_entity_load_avg(se, 1); > + /* track runqueues/cpus that need to be re-evaluated */ > + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus); > } > > + /* !CONFIG_FAIR_GROUP_SCHED */ > if (!se) { > sub_nr_running(rq, 1); > update_rq_runnable_avg(rq, 1); > + > + /* > + * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to > + * typedef update_cpus into an int and skip all of the cpumask > + * stuff > + */ > + cpumask_set_cpu(cpu_of(rq), &update_cpus); > } > + > + if (energy_aware()) > + if (!cpumask_empty(&update_cpus)) > + arch_eval_cpu_freq(&update_cpus); > + > hrtick_update(rq); > } > > @@ -7536,6 +7572,9 @@ static void run_rebalance_domains(struct softirq_action *h) > * stopped. > */ > nohz_idle_balance(this_rq, idle); > + > + if (energy_aware()) > + arch_scale_cpu_freq(); > } > > /* >
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1af6f6d..3619f63 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; + struct cpumask update_cpus; + + cpumask_clear(&update_cpus); for_each_sched_entity(se) { if (se->on_rq) @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1); + /* track cpus that need to be re-evaluated */ + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus); } + /* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1); + + /* + * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to + * typedef update_cpus into an int and skip all of the cpumask + * stuff + */ + cpumask_set_cpu(cpu_of(rq), &update_cpus); } + + if (energy_aware()) + if (!cpumask_empty(&update_cpus)) + arch_eval_cpu_freq(&update_cpus); + hrtick_update(rq); } @@ -4049,6 +4067,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; int task_sleep = flags & DEQUEUE_SLEEP; + struct cpumask update_cpus; + + cpumask_clear(&update_cpus); for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); @@ -4089,12 +4110,27 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1); + /* track runqueues/cpus that need to be re-evaluated */ + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus); } + /* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1); + + /* + * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to + * typedef update_cpus into an int and skip all of the cpumask + * stuff + */ + cpumask_set_cpu(cpu_of(rq), &update_cpus); } + + if (energy_aware()) + if (!cpumask_empty(&update_cpus)) + arch_eval_cpu_freq(&update_cpus); + hrtick_update(rq); } @@ -7536,6 +7572,9 @@ static void run_rebalance_domains(struct softirq_action *h) * stopped. */ nohz_idle_balance(this_rq, idle); + + if (energy_aware()) + arch_scale_cpu_freq(); } /*