[0/6] cpufreq: schedutil: fixes for flags updates

Message ID	1488469507-32463-1-git-send-email-patrick.bellasi@arm.com
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Patrick Bellasi <patrick.bellasi@arm.com> To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, "Rafael J . Wysocki" <rafael.j.wysocki@intel.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com>, Viresh Kumar <viresh.kumar@linaro.org>, Steven Rostedt <rostedt@goodmis.org>, Vincent Guittot <vincent.guittot@linaro.org>, John Stultz <john.stultz@linaro.org>, Juri Lelli <juri.lelli@arm.com>, Todd Kjos <tkjos@android.com>, Tim Murray <timmurray@google.com>, Andres Oportus <andresoportus@google.com>, Joel Fernandes <joelaf@google.com>, Morten Rasmussen <morten.rasmussen@arm.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Chris Redpath <chris.redpath@arm.com> Subject: [PATCH 0/6] cpufreq: schedutil: fixes for flags updates Date: Thu, 2 Mar 2017 15:45:01 +0000 Message-Id: <1488469507-32463-1-git-send-email-patrick.bellasi@arm.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	cpufreq: schedutil: fixes for flags updates \| expand [0/6] cpufreq: schedutil: fixes for flags updates [1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter [2/6] cpufreq: schedutil: ignore the sugov kthread for frequencies selections [4/6] cpufreq: schedutil: relax rate-limiting while running RT/DL tasks

Patrick Bellasi March 2, 2017, 3:45 p.m. UTC

The current version of schedutil has some issues related to the management
of update flags used by systems with frequency domains spawning multiple CPUs.

Each time a CPU utilisation update is issued by the scheduler a set of flags
are configured to define (mainly) which class is asking for a utilisation
update. These flags are then used by the frequency selection policy to
identify the OPP to choose.

In the current implementation, CPU flags are overridden each time the
scheduler calls schedutil for an update. Such a behaviour produces issues
in these scenarios, where we assume CPU1 and CPU2 share the same frequency
domain:
a) a RT task which executed on CPU1 can keep the domain at an high frequency
   for a long period of time, even if there are no longer RT tasks on
   CPUs in that domain
b) a FAIR task co-scheduled in the same CPU of a RT task can override the
   flags configured by the RT task and potentially this can cause an
   unwanted frequency drop

These misbehaviours have been verified using a set of simple rt-app based
synthetic workloads, running on a ARM's Juno board, and results are
available in this Notebook [1].

This series proposes a set of fixes for the aforementioned issues as well
as a small improvement to speedup the selection of the maximum frequency
when RT tasks enter a CPU.

This series is based on top of today's tip/sched/core and is public available
from this repository:

  git://www.linux-arm.com/linux-pb eas/schedutil/flags_fixes

Cheers Patrick

[1] https://gist.github.com/d6a21b459a18091b2b058668a550010d

Patrick Bellasi (6):
  cpufreq: schedutil: reset sg_cpus's flags at IDLE enter
  cpufreq: schedutil: ignore the sugov kthread for frequencies
    selections
  cpufreq: schedutil: ensure max frequency while running RT/DL tasks
  cpufreq: schedutil: relax rate-limiting while running RT/DL tasks
  cpufreq: schedutil: avoid utilisation update when not necessary
  sched/rt: fast switch to maximum frequency when RT tasks are scheduled

 include/linux/sched.h            |  1 +
 kernel/sched/cpufreq_schedutil.c | 59 ++++++++++++++++++++++++++++++++++------
 kernel/sched/idle_task.c         |  4 +++
 kernel/sched/rt.c                | 15 ++++++++--
 4 files changed, 68 insertions(+), 11 deletions(-)

-- 
2.7.4

Vincent Guittot March 2, 2017, 4:09 p.m. UTC | #1

On 2 March 2017 at 16:45, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> The current version of schedutil has some issues related to the management

> of update flags used by systems with frequency domains spawning multiple CPUs.

>

> Each time a CPU utilisation update is issued by the scheduler a set of flags

> are configured to define (mainly) which class is asking for a utilisation

> update. These flags are then used by the frequency selection policy to

> identify the OPP to choose.

>

> In the current implementation, CPU flags are overridden each time the

> scheduler calls schedutil for an update. Such a behaviour produces issues

> in these scenarios, where we assume CPU1 and CPU2 share the same frequency

> domain:

> a) a RT task which executed on CPU1 can keep the domain at an high frequency

>    for a long period of time, even if there are no longer RT tasks on

>    CPUs in that domain


Normally this is dropped after a tick.
Nevertheless, there is an issue if the freq_update_delay_ns is shorter
than a tick because of sugov RT thread

> b) a FAIR task co-scheduled in the same CPU of a RT task can override the

>    flags configured by the RT task and potentially this can cause an

>    unwanted frequency drop

>

> These misbehaviours have been verified using a set of simple rt-app based

> synthetic workloads, running on a ARM's Juno board, and results are

> available in this Notebook [1].

>

> This series proposes a set of fixes for the aforementioned issues as well

> as a small improvement to speedup the selection of the maximum frequency

> when RT tasks enter a CPU.

>

> This series is based on top of today's tip/sched/core and is public available

> from this repository:

>

>   git://www.linux-arm.com/linux-pb eas/schedutil/flags_fixes

>

> Cheers Patrick

>

> [1] https://gist.github.com/d6a21b459a18091b2b058668a550010d

>

> Patrick Bellasi (6):

>   cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

>   cpufreq: schedutil: ignore the sugov kthread for frequencies

>     selections

>   cpufreq: schedutil: ensure max frequency while running RT/DL tasks

>   cpufreq: schedutil: relax rate-limiting while running RT/DL tasks

>   cpufreq: schedutil: avoid utilisation update when not necessary

>   sched/rt: fast switch to maximum frequency when RT tasks are scheduled

>

>  include/linux/sched.h            |  1 +

>  kernel/sched/cpufreq_schedutil.c | 59 ++++++++++++++++++++++++++++++++++------

>  kernel/sched/idle_task.c         |  4 +++

>  kernel/sched/rt.c                | 15 ++++++++--

>  4 files changed, 68 insertions(+), 11 deletions(-)

>

> --

> 2.7.4

>

Viresh Kumar March 3, 2017, 8:31 a.m. UTC | #2

On 02-03-17, 15:45, Patrick Bellasi wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> @@ -293,15 +305,29 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

>  	if (curr == sg_policy->thread)

>  		goto done;

>  

> +	/*

> +	 * While RT/DL tasks are running we do not want FAIR tasks to

> +	 * overwrite this CPU's flags, still we can update utilization and

> +	 * frequency (if required/possible) to be fair with these tasks.

> +	 */

> +	rt_mode = task_has_dl_policy(curr) ||

> +		  task_has_rt_policy(curr) ||

> +		  (flags & SCHED_CPUFREQ_RT_DL);

> +	if (rt_mode)

> +		sg_cpu->flags |= flags;

> +	else

> +		sg_cpu->flags = flags;


This looks so hacked up :)

Wouldn't it be better to let the scheduler tell us what all kind of tasks it has
in the rq of a CPU and pass a mask of flags? I think it wouldn't be difficult
(or time consuming) for the scheduler to know that, but I am not 100% sure.

IOW, the flags field in cpufreq_update_util() will represent all tasks in the
rq, instead of just the task that is getting enqueued/dequeued..

And obviously we need to get some utilization numbers for the RT and DL tasks
going forward, switching to max isn't going to work for ever :)

-- 
viresh

Rafael J. Wysocki March 15, 2017, 11:52 a.m. UTC | #3

On Friday, March 03, 2017 12:38:30 PM Patrick Bellasi wrote:
> On 03-Mar 14:01, Viresh Kumar wrote:

> > On 02-03-17, 15:45, Patrick Bellasi wrote:

> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > @@ -293,15 +305,29 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

> > >  	if (curr == sg_policy->thread)

> > >  		goto done;

> > >  

> > > +	/*

> > > +	 * While RT/DL tasks are running we do not want FAIR tasks to

> > > +	 * overwrite this CPU's flags, still we can update utilization and

> > > +	 * frequency (if required/possible) to be fair with these tasks.

> > > +	 */

> > > +	rt_mode = task_has_dl_policy(curr) ||

> > > +		  task_has_rt_policy(curr) ||

> > > +		  (flags & SCHED_CPUFREQ_RT_DL);

> > > +	if (rt_mode)

> > > +		sg_cpu->flags |= flags;

> > > +	else

> > > +		sg_cpu->flags = flags;

> > 

> > This looks so hacked up :)

> 

> It is... a bit... :)

> 

> > Wouldn't it be better to let the scheduler tell us what all kind of tasks it has

> > in the rq of a CPU and pass a mask of flags?

> 

> That would definitively report a more consistent view of what's going

> on on each CPU.

> 

> > I think it wouldn't be difficult (or time consuming) for the

> > scheduler to know that, but I am not 100% sure.

> 

> Main issue perhaps is that cpufreq_update_{util,this_cpu} are

> currently called by the scheduling classes codes and not from the core

> scheduler. However I agree that it should be possible to build up such

> information and make it available to the scheduling classes code.

> 

> I'll have a look at that.

> 

> > IOW, the flags field in cpufreq_update_util() will represent all tasks in the

> > rq, instead of just the task that is getting enqueued/dequeued..

> > 

> > And obviously we need to get some utilization numbers for the RT and DL tasks

> > going forward, switching to max isn't going to work for ever :)

> 

> Regarding this last point, there are WIP patches Juri is working on to

> feed DL demands to schedutil, his presentation at last ELC partially

> covers these developments:

>   https://www.youtube.com/watch?v=wzrcWNIneWY&index=37&list=PLbzoR-pLrL6pSlkQDW7RpnNLuxPq6WVUR

> 

> Instead, RT tasks are currently covered by an rt_avg metric which we

> already know is not fitting for most purposes.

> It seems that the main goal is twofold: move people to DL whenever

> possible otherwise live with the go-to-max policy which is the only

> sensible solution to satisfy the RT's class main goal, i.e. latency

> reduction.

> 

> Of course such a go-to-max policy for all RT tasks we already know

> that is going to destroy energy on many different mobile scenarios.

> 

> As a possible mitigation for that, while still being compliant with

> the main RT's class goal, we recently posted the SchedTune v3

> proposal:

>   https://lkml.org/lkml/2017/2/28/355

> 

> In that proposal, the simple usage of CGroups and the new capacity_max

> attribute of the (existing) CPU controller should allow to define what

> is the "max" value which is just enough to match the latency

> constraints of a mobile application without sacrificing too much

> energy.


And who's going to figure out what "max" value is most suitable?  And how?

Thanks,
Rafael

Patrick Bellasi March 15, 2017, 2:40 p.m. UTC | #4

On 15-Mar 12:52, Rafael J. Wysocki wrote:
> On Friday, March 03, 2017 12:38:30 PM Patrick Bellasi wrote:

> > On 03-Mar 14:01, Viresh Kumar wrote:

> > > On 02-03-17, 15:45, Patrick Bellasi wrote:

> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > > > @@ -293,15 +305,29 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

> > > >  	if (curr == sg_policy->thread)

> > > >  		goto done;

> > > >  

> > > > +	/*

> > > > +	 * While RT/DL tasks are running we do not want FAIR tasks to

> > > > +	 * overwrite this CPU's flags, still we can update utilization and

> > > > +	 * frequency (if required/possible) to be fair with these tasks.

> > > > +	 */

> > > > +	rt_mode = task_has_dl_policy(curr) ||

> > > > +		  task_has_rt_policy(curr) ||

> > > > +		  (flags & SCHED_CPUFREQ_RT_DL);

> > > > +	if (rt_mode)

> > > > +		sg_cpu->flags |= flags;

> > > > +	else

> > > > +		sg_cpu->flags = flags;

> > > 

> > > This looks so hacked up :)

> > 

> > It is... a bit... :)

> > 

> > > Wouldn't it be better to let the scheduler tell us what all kind of tasks it has

> > > in the rq of a CPU and pass a mask of flags?

> > 

> > That would definitively report a more consistent view of what's going

> > on on each CPU.

> > 

> > > I think it wouldn't be difficult (or time consuming) for the

> > > scheduler to know that, but I am not 100% sure.

> > 

> > Main issue perhaps is that cpufreq_update_{util,this_cpu} are

> > currently called by the scheduling classes codes and not from the core

> > scheduler. However I agree that it should be possible to build up such

> > information and make it available to the scheduling classes code.

> > 

> > I'll have a look at that.

> > 

> > > IOW, the flags field in cpufreq_update_util() will represent all tasks in the

> > > rq, instead of just the task that is getting enqueued/dequeued..

> > > 

> > > And obviously we need to get some utilization numbers for the RT and DL tasks

> > > going forward, switching to max isn't going to work for ever :)

> > 

> > Regarding this last point, there are WIP patches Juri is working on to

> > feed DL demands to schedutil, his presentation at last ELC partially

> > covers these developments:

> >   https://www.youtube.com/watch?v=wzrcWNIneWY&index=37&list=PLbzoR-pLrL6pSlkQDW7RpnNLuxPq6WVUR

> > 

> > Instead, RT tasks are currently covered by an rt_avg metric which we

> > already know is not fitting for most purposes.

> > It seems that the main goal is twofold: move people to DL whenever

> > possible otherwise live with the go-to-max policy which is the only

> > sensible solution to satisfy the RT's class main goal, i.e. latency

> > reduction.

> > 

> > Of course such a go-to-max policy for all RT tasks we already know

> > that is going to destroy energy on many different mobile scenarios.

> > 

> > As a possible mitigation for that, while still being compliant with

> > the main RT's class goal, we recently posted the SchedTune v3

> > proposal:

> >   https://lkml.org/lkml/2017/2/28/355

> > 

> > In that proposal, the simple usage of CGroups and the new capacity_max

> > attribute of the (existing) CPU controller should allow to define what

> > is the "max" value which is just enough to match the latency

> > constraints of a mobile application without sacrificing too much

> > energy.

Given the following interesting question, let's add Thomas Gleixner to
the discussion, which can be interested in sharing his viewpoint.

> And who's going to figure out what "max" value is most suitable?  And how?

That's usually up to the system profiling which is part of the
platform optimizations and tunings.
For example it's possible to run  experiments to measure the bandwidth
and (completion) latency requirements from the RT workloads.

It's something which people developing embedded/mobile systems are
quite aware of. I'm also quite confident on saying that most of
them can agree that just going to the max OPP, each and every time a
RT task becomes RUNNABLE, it is something which is more likely to hurt
than to give benefits.

AFAIK the current policy (i.e. "go to max") has been adopted for the
following main reasons, which I'm reporting with some observations.

.:: Missing of a suitable utilization metric for RT tasks

 There is actually a utilization signal (rq->rt_avg) but it has been
 verified to be "too slow" for the practical usage of driving OPP
 selection.
 Other possibilities are perhaps under exploration but they are not
 yet there.

.:: Promote the migration from RT to DEADLINE

 Which makes a lot of sens for many existing use-cases, starting from
 Android as well. However, it's also true that we cannot (at least yet)
 split the world in DEALINE vs FAIR.
 There is still, and there will be, a fair amount of RT tasks which
 just it makes sense to serve at best both from the performance as
 well as the power/energy standpoint.

.:: Because RT is all about "reducing latencies"

 Running at the maximum OPP is certainly the best way to aim for the
 minimum latencies but... RT is about doing things "in time", which
 does not imply "as fast as possible".
 There can be many different workloads where a lower OPP is just good
 enough to meet the expected soft RT behaviors provided by the Linux
 RT scheduler.

All that considered, the modifications proposed in this series,
combined with other bits which are for discussion in this [1] other
posting, can work together to provide a better and more tunable OPP
selection policy for RT tasks.

> Thanks,

> Rafael

Cheers Patrick

[1] https://lkml.org/lkml/2017/2/28/355

-- 
#include <best/regards.h>

Patrick Bellasi

Rafael J. Wysocki March 15, 2017, 11:32 p.m. UTC | #5

On Wed, Mar 15, 2017 at 3:40 PM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> On 15-Mar 12:52, Rafael J. Wysocki wrote:

>> On Friday, March 03, 2017 12:38:30 PM Patrick Bellasi wrote:

>> > On 03-Mar 14:01, Viresh Kumar wrote:

>> > > On 02-03-17, 15:45, Patrick Bellasi wrote:

>> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

>> > > > @@ -293,15 +305,29 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

>> > > >         if (curr == sg_policy->thread)

>> > > >                 goto done;

>> > > >

>> > > > +       /*

>> > > > +        * While RT/DL tasks are running we do not want FAIR tasks to

>> > > > +        * overwrite this CPU's flags, still we can update utilization and

>> > > > +        * frequency (if required/possible) to be fair with these tasks.

>> > > > +        */

>> > > > +       rt_mode = task_has_dl_policy(curr) ||

>> > > > +                 task_has_rt_policy(curr) ||

>> > > > +                 (flags & SCHED_CPUFREQ_RT_DL);

>> > > > +       if (rt_mode)

>> > > > +               sg_cpu->flags |= flags;

>> > > > +       else

>> > > > +               sg_cpu->flags = flags;

>> > >

>> > > This looks so hacked up :)

>> >

>> > It is... a bit... :)

>> >

>> > > Wouldn't it be better to let the scheduler tell us what all kind of tasks it has

>> > > in the rq of a CPU and pass a mask of flags?

>> >

>> > That would definitively report a more consistent view of what's going

>> > on on each CPU.

>> >

>> > > I think it wouldn't be difficult (or time consuming) for the

>> > > scheduler to know that, but I am not 100% sure.

>> >

>> > Main issue perhaps is that cpufreq_update_{util,this_cpu} are

>> > currently called by the scheduling classes codes and not from the core

>> > scheduler. However I agree that it should be possible to build up such

>> > information and make it available to the scheduling classes code.

>> >

>> > I'll have a look at that.

>> >

>> > > IOW, the flags field in cpufreq_update_util() will represent all tasks in the

>> > > rq, instead of just the task that is getting enqueued/dequeued..

>> > >

>> > > And obviously we need to get some utilization numbers for the RT and DL tasks

>> > > going forward, switching to max isn't going to work for ever :)

>> >

>> > Regarding this last point, there are WIP patches Juri is working on to

>> > feed DL demands to schedutil, his presentation at last ELC partially

>> > covers these developments:

>> >   https://www.youtube.com/watch?v=wzrcWNIneWY&index=37&list=PLbzoR-pLrL6pSlkQDW7RpnNLuxPq6WVUR

>> >

>> > Instead, RT tasks are currently covered by an rt_avg metric which we

>> > already know is not fitting for most purposes.

>> > It seems that the main goal is twofold: move people to DL whenever

>> > possible otherwise live with the go-to-max policy which is the only

>> > sensible solution to satisfy the RT's class main goal, i.e. latency

>> > reduction.

>> >

>> > Of course such a go-to-max policy for all RT tasks we already know

>> > that is going to destroy energy on many different mobile scenarios.

>> >

>> > As a possible mitigation for that, while still being compliant with

>> > the main RT's class goal, we recently posted the SchedTune v3

>> > proposal:

>> >   https://lkml.org/lkml/2017/2/28/355

>> >

>> > In that proposal, the simple usage of CGroups and the new capacity_max

>> > attribute of the (existing) CPU controller should allow to define what

>> > is the "max" value which is just enough to match the latency

>> > constraints of a mobile application without sacrificing too much

>> > energy.

>

> Given the following interesting question, let's add Thomas Gleixner to

> the discussion, which can be interested in sharing his viewpoint.

>

>> And who's going to figure out what "max" value is most suitable?  And how?

>

> That's usually up to the system profiling which is part of the

> platform optimizations and tunings.

> For example it's possible to run  experiments to measure the bandwidth

> and (completion) latency requirements from the RT workloads.

>

> It's something which people developing embedded/mobile systems are

> quite aware of.


Well, I was expecting an answer like this to be honest and let me say
that it is not too convincing from my perspective.

At least throwing embedded and mobile into one bag seems to be a
stretch, because the usage patterns of those two groups are quite
different, even though they may be similar from the hardware POV.

Mobile are mostly used as general-purpose computers nowadays (and I
guess we're essentially talking about anything running Android, not
just phones, aren't we?) with applications installed by users rather
than by system integrators, so I'm doubtful about the viability of the
"system integrators should take care of it" assumption in this
particular case.

> I'm also quite confident on saying that most of

> them can agree that just going to the max OPP, each and every time a

> RT task becomes RUNNABLE, it is something which is more likely to hurt

> than to give benefits.


It would be good to be able to estimate that likelihood somehow ...

> AFAIK the current policy (i.e. "go to max") has been adopted for the

> following main reasons, which I'm reporting with some observations.

>

>

> .:: Missing of a suitable utilization metric for RT tasks

>

>  There is actually a utilization signal (rq->rt_avg) but it has been

>  verified to be "too slow" for the practical usage of driving OPP

>  selection.

>  Other possibilities are perhaps under exploration but they are not

>  yet there.

>

>

> .:: Promote the migration from RT to DEADLINE

>

>  Which makes a lot of sens for many existing use-cases, starting from

>  Android as well. However, it's also true that we cannot (at least yet)

>  split the world in DEALINE vs FAIR.

>  There is still, and there will be, a fair amount of RT tasks which

>  just it makes sense to serve at best both from the performance as

>  well as the power/energy standpoint.

>

>

> .:: Because RT is all about "reducing latencies"

>

>  Running at the maximum OPP is certainly the best way to aim for the

>  minimum latencies but... RT is about doing things "in time", which

>  does not imply "as fast as possible".

>  There can be many different workloads where a lower OPP is just good

>  enough to meet the expected soft RT behaviors provided by the Linux

>  RT scheduler.


AFAICS, RT is about determinism and meeting deadlines is one aspect of that.

Essentially, the problem here is to know which OPP is sufficient to do
things "on time" and there simply is no good way to figure that out
for RT tasks, again, AFAICS.

And since we can't say which OPP is sufficient, we pretty much have no
choice but to use the top-most one.

> All that considered, the modifications proposed in this series,

> combined with other bits which are for discussion in this [1] other

> posting, can work together to provide a better and more tunable OPP

> selection policy for RT tasks.


OK, but "more tunable" need not imply "better".

Thanks,
Rafael

[0/6] cpufreq: schedutil: fixes for flags updates

Message

Comments