[RFC,0/8] SCHED_DEADLINE freq/cpu invariance and OPP selection

Message ID	20170523085351.18586-1-juri.lelli@arm.com
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Juri Lelli <juri.lelli@arm.com> To: peterz@infradead.org, mingo@redhat.com, rjw@rjwysocki.net, viresh.kumar@linaro.org Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, tglx@linutronix.de, vincent.guittot@linaro.org, rostedt@goodmis.org, luca.abeni@santannapisa.it, claudio@evidence.eu.com, tommaso.cucinotta@santannapisa.it, bristot@redhat.com, mathieu.poirier@linaro.org, tkjos@android.com, joelaf@google.com, andresoportus@google.com, morten.rasmussen@arm.com, dietmar.eggemann@arm.com, patrick.bellasi@arm.com, juri.lelli@arm.com Subject: [PATCH RFC 0/8] SCHED_DEADLINE freq/cpu invariance and OPP selection Date: Tue, 23 May 2017 09:53:43 +0100 Message-Id: <20170523085351.18586-1-juri.lelli@arm.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	SCHED_DEADLINE freq/cpu invariance and OPP selection \| expand [RFC,0/8] SCHED_DEADLINE freq/cpu invariance and OPP selection [RFC,1/8] sched/cpufreq_schedutil: make use of DEADLINE utilization signal [RFC,3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE [RFC,4/8] sched/cpufreq_schedutil: split utilization signals [RFC,6/8] sched/sched.h: remove sd arch_scale_freq_capacity parameter [RFC,7/8] sched/sched.h: move arch_scale_{freq,cpu}_capacity outside CONFIG_SMP

Message ID

20170523085351.18586-1-juri.lelli@arm.com

Headers

Received-SPF: pass (google.com: best guess record for domain of
	linux-kernel-owner@vger.kernel.org designates 209.132.180.67
	as permitted sender) client-ip=209.132.180.67; 
From: Juri Lelli <juri.lelli@arm.com>
To: peterz@infradead.org, mingo@redhat.com, rjw@rjwysocki.net,
	viresh.kumar@linaro.org
Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	tglx@linutronix.de, vincent.guittot@linaro.org,
	rostedt@goodmis.org, luca.abeni@santannapisa.it,
	claudio@evidence.eu.com, tommaso.cucinotta@santannapisa.it,
	bristot@redhat.com, mathieu.poirier@linaro.org, tkjos@android.com,
	joelaf@google.com, andresoportus@google.com,
	morten.rasmussen@arm.com, dietmar.eggemann@arm.com,
	patrick.bellasi@arm.com, juri.lelli@arm.com
Subject: [PATCH RFC 0/8] SCHED_DEADLINE freq/cpu invariance and OPP selection
Date: Tue, 23 May 2017 09:53:43 +0100
Message-Id: <20170523085351.18586-1-juri.lelli@arm.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Series

SCHED_DEADLINE freq/cpu invariance and OPP selection | expand

Message

Juri Lelli May 23, 2017, 8:53 a.m. UTC

Hi,

this RFC set implements frequency/cpu invariance and OPP selection for
SCHED_DEADLINE. The set has been slightly tested on a Juno platform. The
current incarnation of the patches stems both from previous RFD[1] review
comments and discussion at OSPM-summit[2], during which we seemed to agree
that:

 - we probably want to use running_bw (instead of this_bw), as it is less
   pessimistic (we should save more energy)
 - special kworker hack seems acceptable as a mid term solution to foster
   further SCHED_DEADLINE/schedutil development/adoption

A point that is still very much up for discussion (more that the others :) is
how we implement frequency/cpu scaling. SCHED_FLAG_RECLAIM tasks only need
grub_reclaim(), as the function already scales their reservation runtime
considering other reservations and maximum bandwidth a CPU has to offer.
However, for normal !RECLAIM tasks multiple things can be implemented which
seem to make sense:

 - don't scale at all: normal tasks will only get a % of CPU _time_ as granted
   by AC
 - go to max as soon as a normal task in enqueued: this because dimensioning of
   parameters is usually done at max OPP/biggest CPU and normal task assume
   that this is always the condition when they run
 - scale runtime acconding to current frequency and max CPU capacity: this is
   what this set is currently implementing

Opinions?

The set is based on tip/sched/core as of today (a9e7f6544b9c) plus some
schedutil fixes coming from linux-pm/linux-next and Luca's "CPU reclaiming for
SCHED_DEADLINE" [3].

Patches high level description:

 o [01-02]/08 add the necessary links to start accounting DEADLINE contribution
              to OPP selection 
 o 03/08      it's a temporary solution to make possible (on ARM) to change
              frequency for DEADLINE tasks (that would possibly delay the SCHED_FIFO
              worker kthread); proper solution would be to be able to issue frequency
              transition from an atomic ctx
 o [04-05]/08 it's a schedutil change that copes with the fact that DEADLINE
              doesn't require periodic OPP selection triggering point
 o [06-07]/08 make arch_scale_{freq,cpu}_capacity() function available on !CONFIG_SMP
              configurations too
 o 08/08      implements frequency/cpu invariance for tasks' reservation
              parameters; which basically means that we implement GRUB-PA [4]

Changes w.r.t. RFD:

 - use grub_reclaim for RECLAIM and scale freq/cpu for !RECLAIM
 - discard CFS contribution only, after TICK_NSEC
 - added patches 06 and 07 to fix !CONFIG_SMP builds

Please have a look. Feedback and comments are, as usual, more than welcome.

In case you would like to test this out:

 git://linux-arm.org/linux-jl.git upstream/deadline/freq-rfc

Best,

- Juri

[1] http://marc.info/?l=linux-kernel&m=149036457909119&w=2
[2] http://retis.sssup.it/ospm-summit/program.html
    https://lwn.net/Articles/721573/
[3] http://marc.info/?l=linux-kernel&m=149513848804404
[4] C. Scordino, G. Lipari, A Resource Reservation Algorithm for Power-Aware
    Scheduling of Periodic and Aperiodic Real-Time Tasks, IEEE Transactions
    on Computers, December 2006.

Juri Lelli (8):
  sched/cpufreq_schedutil: make use of DEADLINE utilization signal
  sched/deadline: move cpu frequency selection triggering points
  sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE
  sched/cpufreq_schedutil: split utilization signals
  sched/cpufreq_schedutil: always consider all CPUs when deciding next
    freq
  sched/sched.h: remove sd arch_scale_freq_capacity parameter
  sched/sched.h: move arch_scale_{freq,cpu}_capacity outside CONFIG_SMP
  sched/deadline: make bandwidth enforcement scale-invariant

 include/linux/sched.h            |  1 +
 include/linux/sched/cpufreq.h    |  2 --
 include/linux/sched/topology.h   | 12 ++++----
 include/uapi/linux/sched.h       |  1 +
 kernel/sched/core.c              | 19 ++++++++++--
 kernel/sched/cpufreq_schedutil.c | 62 ++++++++++++++++++++++++----------------
 kernel/sched/deadline.c          | 39 ++++++++++++++++++++-----
 kernel/sched/fair.c              |  4 +--
 kernel/sched/sched.h             | 27 +++++++++++++----
 9 files changed, 116 insertions(+), 51 deletions(-)

-- 
2.11.0

Comments

luca abeni May 24, 2017, 10:01 a.m. UTC | #1

On Wed, 24 May 2017 10:25:05 +0100
Juri Lelli <juri.lelli@arm.com> wrote:

> Hi,

> 

> On 23/05/17 22:23, Peter Zijlstra wrote:

> > On Tue, May 23, 2017 at 09:53:43AM +0100, Juri Lelli wrote:

> >   

> > > A point that is still very much up for discussion (more that the

> > > others :) is how we implement frequency/cpu scaling.

> > > SCHED_FLAG_RECLAIM tasks only need grub_reclaim(), as the

> > > function already scales their reservation runtime considering

> > > other reservations and maximum bandwidth a CPU has to offer.

> > > However, for normal !RECLAIM tasks multiple things can be

> > > implemented which seem to make sense:

> > > 

> > >  - don't scale at all: normal tasks will only get a % of CPU

> > > _time_ as granted by AC

> > >  - go to max as soon as a normal task in enqueued: this because

> > > dimensioning of parameters is usually done at max OPP/biggest CPU

> > > and normal task assume that this is always the condition when

> > > they run

> > >  - scale runtime acconding to current frequency and max CPU

> > > capacity: this is what this set is currently implementing

> > > 

> > > Opinions?  

> > 

> > 

> > So I'm terribly confused...

> > 

> > By using the active bandwidth to select frequency we effectively

> > reduce idle time (to 0 if we had infinite granular frequency steps

> > and no margins).

> > 

> > So !RECLAIM works as expected. They get the time they reserved,

> > since that was taken into account by active bandwidth.

> >   

> 

> This was my impression as well, but Luca (and please Luca correct me

> if I misunderstood your point) argued (in an off-line discussion

> ahead of this posting) that !reclaim tasks might not be interested in

> reclaiming *at all*.


Well, I also admitted that I am almost completely ignorant about many
people's requirements...

What I know is that there are some people using SCHED_DEADLINE to make
sure that a task can make progress (executing with a "high priority")
without consuming more than a specified fraction of CPU time... So,
they for example schedule a CPU-hungry task with runtime=10ms and
period=100ms to make sure that the task can execute every 100ms (giving
the impression of a "fluid progress") without stealing more than 10% of
CPU time to other tasks.

In this case, if the CPU frequency change the goal is still to
"reserve" 10% of CPU time (not more, even if the CPU is slower) to the
task. So, no runtime rescaling (or reclaiming) is required in this case.


My proposal was that if a task is not interested in a fixed
runtime / fraction of CPU time but wants to adapt the runtime when the
CPU frequency scales, then it can select the RECLAIMING flag.

But of course there might be different requirements or other use-cases.



			Luca

> Since scaling frequency down is another way of

> effectively reclaiming unused bandwidth (the other being sharing

> unused bandwidth among reservations while keeping frequency at

> max), !reclaim tasks could not be interested in frequency scaling (my

> first point above) or require frequency to be always at max (second

> point above).

> 

> Does this help claryfing a bit? :)

> 

> This said however, I'd personally be inclined to go with option 3

> above, which is what this set is currently implementing.

> 

> > And RECLAIM works, since that only promises to (re)distribute idle

> > time, and if there is none that is an easy task.

> >   

> 

> Right.

> 

> Thanks,

> 

> - Juri

Peter Zijlstra May 24, 2017, 11:29 a.m. UTC | #2

On Wed, May 24, 2017 at 12:01:51PM +0200, Luca Abeni wrote:
> > > So I'm terribly confused...

> > > 

> > > By using the active bandwidth to select frequency we effectively

> > > reduce idle time (to 0 if we had infinite granular frequency steps

> > > and no margins).

> > > 

> > > So !RECLAIM works as expected. They get the time they reserved,

> > > since that was taken into account by active bandwidth.



> Well, I also admitted that I am almost completely ignorant about many

> people's requirements...

> 

> What I know is that there are some people using SCHED_DEADLINE to make

> sure that a task can make progress (executing with a "high priority")

> without consuming more than a specified fraction of CPU time... So,

> they for example schedule a CPU-hungry task with runtime=10ms and

> period=100ms to make sure that the task can execute every 100ms (giving

> the impression of a "fluid progress") without stealing more than 10% of

> CPU time to other tasks.

> 

> In this case, if the CPU frequency change the goal is still to

> "reserve" 10% of CPU time (not more, even if the CPU is slower) to the

> task. So, no runtime rescaling (or reclaiming) is required in this case.

> 

> 

> My proposal was that if a task is not interested in a fixed

> runtime / fraction of CPU time but wants to adapt the runtime when the

> CPU frequency scales, then it can select the RECLAIMING flag.


I think these people are doing it wrong :-)

Firstly, the runtime budget is a WCET. This very much means it is
subject to CPU frequency; after all, when the CPU runs slower, that same
amount of work takes longer. So being subject to cpufreq is the natural
state and should not require a special marker.

Secondly, if you want a steady progress of 10%, I don't see the problem
with giving them more at slower frequency, they get the 'same' amount of
'work' done without bothering other people.

Peter Zijlstra May 24, 2017, 11:31 a.m. UTC | #3

On Wed, May 24, 2017 at 10:50:53AM +0100, Juri Lelli wrote:

> Agreed. However, problem seems to be that

> 

>  - in my opinion (current implementation) this translated into scaling

>    runtime considering current freq and cpu-max-capacity; and this is

>    required when frequency scaling is enabled and we still want to meet

>    a task's guaranteed bandwidth

Just so. The bandwidth they request is based on instructions/work. We
need to get a certain amount of instructions sorted. Nobody cares we get
an exact 10% at random frequency if they loose they finger because we
didn't get that final instruction out that stops the saw blade.

>  - Luca seemed instead to be inclined to say that, if we scale runtime

>    for !reclaim tasks, such tasks are basically allowed to run for more

>    time (when frequency is lower than max) by using some of the

>    bandwidth not allocated to themselves

Yes, that's a wrong view :-) We don't care about 'time', we care about
getting the instruction stream / work completed.