Message ID | 17ff0b5d83a1275a98f0d1b87daf275f3e964af3.1513158452.git.viresh.kumar@linaro.org |
---|---|
State | New |
Headers | show |
Series | sched: cpufreq: Track util update flags | expand |
Hi Viresh, On 13/12/17 15:23, Viresh Kumar wrote: > Currently the schedutil governor overwrites the sg_cpu->flags field on > every call to the utilization handler. It was pretty good as the initial > implementation of utilization handlers, there are several drawbacks > though. > > The biggest drawback is that the sg_cpu->flags field doesn't always > represent the correct type of tasks that are enqueued on a CPU's rq. For > example, if a fair task is enqueued while a RT or DL task is running, we > will overwrite the flags with value 0 and that may take the CPU to lower > OPPs unintentionally. There can be other corner cases as well which we > aren't aware of currently. > > This patch changes the current implementation to keep track of all the > task types that are currently enqueued to the CPUs rq. A new flag: CLEAR > is introduced and is set by the scheduling classes when their last task > is dequeued. When the CLEAR flag bit is set, the schedutil governor resets > all the other flag bits that are present in the flags parameter. For > now, the util update handlers return immediately if they were called to > clear the flag. > > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> > --- > include/linux/sched/cpufreq.h | 7 ++++++- > kernel/sched/cpufreq_schedutil.c | 21 ++++++++++++++++++--- > kernel/sched/deadline.c | 4 ++++ > kernel/sched/fair.c | 8 ++++++-- > kernel/sched/rt.c | 4 ++++ > 5 files changed, 38 insertions(+), 6 deletions(-) > > diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h > index d1ad3d825561..6f6641e61236 100644 > --- a/include/linux/sched/cpufreq.h > +++ b/include/linux/sched/cpufreq.h > @@ -8,10 +8,15 @@ > * Interface between cpufreq drivers and the scheduler: > */ > > +#define SCHED_CPUFREQ_CLEAR (1U << 31) > #define SCHED_CPUFREQ_RT (1U << 0) > #define SCHED_CPUFREQ_DL (1U << 1) > -#define SCHED_CPUFREQ_IOWAIT (1U << 2) > +#define SCHED_CPUFREQ_CFS (1U << 2) This flag doesn't do much, does it? I mean RT/DL/IOWAIT are used to bump up frequency, while you are adding CFS for the sake of simmetry, right? And with my patches DL will hopefully soon be in the same situation. If my understanding is correct, maybe add a comment? Apart from this, patch looks good to me; couldn't test it though, sorry. Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Thanks, - Juri
On Wed, Dec 13, 2017 at 10:53 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote: > Currently the schedutil governor overwrites the sg_cpu->flags field on > every call to the utilization handler. It was pretty good as the initial > implementation of utilization handlers, there are several drawbacks > though. > > The biggest drawback is that the sg_cpu->flags field doesn't always > represent the correct type of tasks that are enqueued on a CPU's rq. For > example, if a fair task is enqueued while a RT or DL task is running, we > will overwrite the flags with value 0 and that may take the CPU to lower > OPPs unintentionally. There can be other corner cases as well which we > aren't aware of currently. > > This patch changes the current implementation to keep track of all the > task types that are currently enqueued to the CPUs rq. A new flag: CLEAR > is introduced and is set by the scheduling classes when their last task > is dequeued. When the CLEAR flag bit is set, the schedutil governor resets > all the other flag bits that are present in the flags parameter. For > now, the util update handlers return immediately if they were called to > clear the flag. > > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> > --- > include/linux/sched/cpufreq.h | 7 ++++++- > kernel/sched/cpufreq_schedutil.c | 21 ++++++++++++++++++--- > kernel/sched/deadline.c | 4 ++++ > kernel/sched/fair.c | 8 ++++++-- > kernel/sched/rt.c | 4 ++++ > 5 files changed, 38 insertions(+), 6 deletions(-) > > diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h > index d1ad3d825561..6f6641e61236 100644 > --- a/include/linux/sched/cpufreq.h > +++ b/include/linux/sched/cpufreq.h > @@ -8,10 +8,15 @@ > * Interface between cpufreq drivers and the scheduler: > */ > > +#define SCHED_CPUFREQ_CLEAR (1U << 31) I'm not thrilled by this, because schedutil is not the only user of the flags and it's totally unclear what the other user(s) should do when this is set. Thanks, Rafael
On 16 December 2017 at 22:10, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Wed, Dec 13, 2017 at 10:53 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote: >> +#define SCHED_CPUFREQ_CLEAR (1U << 31) > > I'm not thrilled by this, because schedutil is not the only user of > the flags and it's totally unclear what the other user(s) should do > when this is set. intel-pstate is the only other user of the IOWAIT flag, right? In order not to change the current behavior, we can update that to return early for now ? -- viresh
On Saturday, December 16, 2017 5:47:07 PM CET Viresh Kumar wrote: > On 16 December 2017 at 22:10, Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Wed, Dec 13, 2017 at 10:53 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote: > > >> +#define SCHED_CPUFREQ_CLEAR (1U << 31) > > > > I'm not thrilled by this, because schedutil is not the only user of > > the flags and it's totally unclear what the other user(s) should do > > when this is set. > > intel-pstate is the only other user of the IOWAIT flag, right? In order > not to change the current behavior, we can update that to return early > for now ? We can do that in principle, but why should it return early? Maybe it's a good time to update things, incidentally? I actually don't like the SCHED_CPUFRREQ_CLEAR flag *concept* as it is very much specific to schedutil and blatantly ignores everybody else. Alternatively, you could add two flags for clearing SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL that could just be ingored entirely by intel_pstate. So, why don't you make SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL "sticky" until, say, SCHED_CPUFREQ_NO_RT and SCHED_CPUFREQ_NO_DL are passed, respectively? Thanks, Rafael
On 17-12-17, 01:19, Rafael J. Wysocki wrote: > We can do that in principle, but why should it return early? Maybe it's > a good time to update things, incidentally? > > I actually don't like the SCHED_CPUFRREQ_CLEAR flag *concept* as it is very > much specific to schedutil and blatantly ignores everybody else. > > Alternatively, you could add two flags for clearing SCHED_CPUFREQ_RT and > SCHED_CPUFREQ_DL that could just be ingored entirely by intel_pstate. > > So, why don't you make SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL "sticky" until, > say, SCHED_CPUFREQ_NO_RT and SCHED_CPUFREQ_NO_DL are passed, respectively? I didn't like adding scheduling class specific flags, and wanted the code to treat all of them in the same way. And then the governors can make a policy over that, on what to ignore and what not to. For example with the current patchset, the governors can know when nothing else is queued on a CPU and CPU is going to get into idle loop. They can choose to (or not to) do something in that case. I just thought that writing consistent (i.e. no special code) code across all classes would be better. -- viresh
On Mon, Dec 18, 2017 at 5:59 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote: > On 17-12-17, 01:19, Rafael J. Wysocki wrote: >> We can do that in principle, but why should it return early? Maybe it's >> a good time to update things, incidentally? >> >> I actually don't like the SCHED_CPUFRREQ_CLEAR flag *concept* as it is very >> much specific to schedutil and blatantly ignores everybody else. >> >> Alternatively, you could add two flags for clearing SCHED_CPUFREQ_RT and >> SCHED_CPUFREQ_DL that could just be ingored entirely by intel_pstate. >> >> So, why don't you make SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL "sticky" until, >> say, SCHED_CPUFREQ_NO_RT and SCHED_CPUFREQ_NO_DL are passed, respectively? > > I didn't like adding scheduling class specific flags, and wanted the code to > treat all of them in the same way. And then the governors can make a policy over > that, on what to ignore and what not to. For example with the current patchset, > the governors can know when nothing else is queued on a CPU and CPU is going to > get into idle loop. They can choose to (or not to) do something in that case. Well, if SCHED_CPUFRREQ_CLEAR means "this CPU is going to enter the idle loop" really, then it is better to call it SCHED_CPUFRREQ_ENTER_IDLE, for example. SCHED_CPUFRREQ_CLEAR meaning basically "you should clear these flags now" doesn't seem to convey any information to whoever doesn't squirrel the flags in the first place.
On 18-Dec 17:29, Viresh Kumar wrote: > On 18-12-17, 12:35, Rafael J. Wysocki wrote: > > Well, if SCHED_CPUFRREQ_CLEAR means "this CPU is going to enter the > > idle loop" really, then it is better to call it > > SCHED_CPUFRREQ_ENTER_IDLE, for example. > > > > SCHED_CPUFRREQ_CLEAR meaning basically "you should clear these flags > > now" doesn't seem to convey any information to whoever doesn't > > squirrel the flags in the first place. > > Right, but when all the flags are cleared, then we can infer that we > are going to idle in the most probable case. > > Anyway, I will implement RT and DL clear flags as you suggested in the > next version. I think Rafael is right, the current API is a big odd since we cannot really set the CLEAR flags by itself. I mean, you can but it will not have effects. Thus, it's probably better from an API standpoint to have a dedicated single clear flag... unfortunately it has to be one per class, which is still not optimal and will likely make the policy code a bit odd. What about extending the signature of the callback method? For example, swithing from: - void (*func)(struct update_util_data *data, u64 time, - unsigned int flags)) + void (*func)(struct update_util_data *data, u64 time, + unsigned int flags, bool set)) Where the additional boolean is actually used to define which operation we wanna perform on the flags? -- #include <best/regards.h> Patrick Bellasi
On Mon, Dec 18, 2017 at 12:59 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote: > On 18-12-17, 12:35, Rafael J. Wysocki wrote: >> Well, if SCHED_CPUFRREQ_CLEAR means "this CPU is going to enter the >> idle loop" really, then it is better to call it >> SCHED_CPUFRREQ_ENTER_IDLE, for example. >> >> SCHED_CPUFRREQ_CLEAR meaning basically "you should clear these flags >> now" doesn't seem to convey any information to whoever doesn't >> squirrel the flags in the first place. > > Right, but when all the flags are cleared, then we can infer that we > are going to idle in the most probable case. > > Anyway, I will implement RT and DL clear flags as you suggested in the > next version. Cool, thanks!
On 18-12-17, 12:14, Patrick Bellasi wrote: > For example, swithing from: > > - void (*func)(struct update_util_data *data, u64 time, > - unsigned int flags)) > + void (*func)(struct update_util_data *data, u64 time, > + unsigned int flags, bool set)) > > Where the additional boolean is actually used to define which > operation we wanna perform on the flags? The code will eventually have the same complexity or ugliness in both the cases. I would like to start with another flag for now and see if people prefer another parameter. -- viresh
Hi Viresh, On Mon, Dec 18, 2017 at 7:12 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote: > On 18-12-17, 12:14, Patrick Bellasi wrote: >> For example, swithing from: >> >> - void (*func)(struct update_util_data *data, u64 time, >> - unsigned int flags)) >> + void (*func)(struct update_util_data *data, u64 time, >> + unsigned int flags, bool set)) >> >> Where the additional boolean is actually used to define which >> operation we wanna perform on the flags? > > The code will eventually have the same complexity or ugliness in both > the cases. I would like to start with another flag for now and see if > people prefer another parameter. Though I think that will solve Rafael's concern of polluting the flags for something schedutil specific. I also feel adding extra callback parameter is cleaner than 2 new clear flags. Thanks, - Joel
On 19-12-17, 08:52, Viresh Kumar wrote: > On 18-12-17, 19:18, Joel Fernandes wrote: > > Hi Viresh, > > > > On Mon, Dec 18, 2017 at 7:12 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote: > > > On 18-12-17, 12:14, Patrick Bellasi wrote: > > >> For example, swithing from: > > >> > > >> - void (*func)(struct update_util_data *data, u64 time, > > >> - unsigned int flags)) > > >> + void (*func)(struct update_util_data *data, u64 time, > > >> + unsigned int flags, bool set)) > > >> > > >> Where the additional boolean is actually used to define which > > >> operation we wanna perform on the flags? > > > > > > The code will eventually have the same complexity or ugliness in both > > > the cases. I would like to start with another flag for now and see if > > > people prefer another parameter. > > > > Though I think that will solve Rafael's concern of polluting the flags > > for something schedutil specific. I also feel adding extra callback > > parameter is cleaner than 2 new clear flags. > > Okay, I will then wait for Rafael to come online and comment on what > he would prefer before posting. I thought about it once more. If we decide eventually to add another parameter, then why isn't the approach that this patch takes better than that? i.e. Use the 31st bit of flags for clear bit ? We can remove setting/clearing flags for CFS, that's it. -- viresh
On Mon, Dec 18, 2017 at 7:26 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote: > On 19-12-17, 08:52, Viresh Kumar wrote: >> On 18-12-17, 19:18, Joel Fernandes wrote: >> > Hi Viresh, >> > >> > On Mon, Dec 18, 2017 at 7:12 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote: >> > > On 18-12-17, 12:14, Patrick Bellasi wrote: >> > >> For example, swithing from: >> > >> >> > >> - void (*func)(struct update_util_data *data, u64 time, >> > >> - unsigned int flags)) >> > >> + void (*func)(struct update_util_data *data, u64 time, >> > >> + unsigned int flags, bool set)) >> > >> >> > >> Where the additional boolean is actually used to define which >> > >> operation we wanna perform on the flags? >> > > >> > > The code will eventually have the same complexity or ugliness in both >> > > the cases. I would like to start with another flag for now and see if >> > > people prefer another parameter. >> > >> > Though I think that will solve Rafael's concern of polluting the flags >> > for something schedutil specific. I also feel adding extra callback >> > parameter is cleaner than 2 new clear flags. >> >> Okay, I will then wait for Rafael to come online and comment on what >> he would prefer before posting. > > I thought about it once more. If we decide eventually to add another > parameter, then why isn't the approach that this patch takes better > than that? i.e. Use the 31st bit of flags for clear bit ? We can > remove setting/clearing flags for CFS, that's it. Yes that's clean to me but then as Rafael said, the use of this flag will be too specific for schedutil-only sg_cpu->flags clearing purpose right? If adding the single flag is OK in the grand cpufreq scheme of things, then that's fine with me. Thanks, - Joel
On 18-12-17, 19:30, Joel Fernandes wrote: > Yes that's clean to me but then as Rafael said, the use of this flag > will be too specific for schedutil-only sg_cpu->flags clearing purpose > right? And so would be the extra parameter ? -- viresh
On Wed, Dec 13, 2017 at 03:23:21PM +0530, Viresh Kumar wrote: > Currently the schedutil governor overwrites the sg_cpu->flags field on > every call to the utilization handler. It was pretty good as the initial > implementation of utilization handlers, there are several drawbacks > though. > > The biggest drawback is that the sg_cpu->flags field doesn't always > represent the correct type of tasks that are enqueued on a CPU's rq. For > example, if a fair task is enqueued while a RT or DL task is running, we > will overwrite the flags with value 0 and that may take the CPU to lower > OPPs unintentionally. There can be other corner cases as well which we > aren't aware of currently. > > This patch changes the current implementation to keep track of all the > task types that are currently enqueued to the CPUs rq. A new flag: CLEAR > is introduced and is set by the scheduling classes when their last task > is dequeued. When the CLEAR flag bit is set, the schedutil governor resets > all the other flag bits that are present in the flags parameter. For > now, the util update handlers return immediately if they were called to > clear the flag. Yeah, not happy about this either; we had code that did the right thing without this extra tracking I think. Also, we can look at the rq state if we want to, we don't need to duplicate that state.
On 19-12-17, 20:25, Peter Zijlstra wrote: > Yeah, not happy about this either; we had code that did the right thing > without this extra tracking I think. Sure, but how do you suggest we fix the problems we are facing with the current design? Patrick had a completely different proposal for solving those problems, which I didn't like very much. This patchset replaced these patches from Patrick: - [PATCH v3 1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter https://marc.info/?l=linux-kernel&m=151204247801633&w=2 - [PATCH v3 2/6] cpufreq: schedutil: ensure max frequency while running RT/DL tasks https://marc.info/?l=linux-kernel&m=151204253801657&w=2 - [PATCH v3 6/6] cpufreq: schedutil: ignore sugov kthreads https://marc.info/?l=linux-kernel&m=151204251501647&w=2 > Also, we can look at the rq state if we want to, we don't need to > duplicate that state. Well that also looks fine to me, and that would mean this: - We remove SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL flags, but still call the utilization callbacks from RT and DL classes. - From the utilization handler, we check runqueues of all three sched classes to see if they have some work pending (this can be done smartly by checking only RT first and skipping other checks if RT has some work). Will that be acceptable ? -- viresh
On Wed, Dec 20, 2017 at 09:34:46AM +0530, Viresh Kumar wrote: > On 19-12-17, 20:25, Peter Zijlstra wrote: > > Yeah, not happy about this either; we had code that did the right thing > > without this extra tracking I think. > > Sure, but how do you suggest we fix the problems we are facing with > the current design? Patrick had a completely different proposal for > solving those problems, which I didn't like very much. This patchset > replaced these patches from Patrick: Please use the normal link format: https://lkml.kernel.org/r/$MSGID Then I can find them without having to resort to a frigging browser thing. I'll try and dig through the email I have. > > Also, we can look at the rq state if we want to, we don't need to > > duplicate that state. > > Well that also looks fine to me, and that would mean this: > > - We remove SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL flags, but still > call the utilization callbacks from RT and DL classes. Didn't juri have patches to make DL do something sane? But yes, I think those flags are part of the problem. > - From the utilization handler, we check runqueues of all three sched > classes to see if they have some work pending (this can be done > smartly by checking only RT first and skipping other checks if RT > has some work). No that's wrong. DL should provide a minimum required based on existing reservations, we can add the expected CFS average on top and request that. And for RT all we need to know is if current is of that class, otherwise we don't care.
On 20-12-17, 09:31, Peter Zijlstra wrote: > On Wed, Dec 20, 2017 at 09:34:46AM +0530, Viresh Kumar wrote: > Please use the normal link format: > > https://lkml.kernel.org/r/$MSGID > > Then I can find them without having to resort to a frigging browser > thing. Sure, and that would be much easier for me as well as that's how I got to those links. Here they are again .. https://lkml.kernel.org/r/20171130114723.29210-2-patrick.bellasi@arm.com https://lkml.kernel.org/r/20171130114723.29210-3-patrick.bellasi@arm.com https://lkml.kernel.org/r/20171130114723.29210-7-patrick.bellasi@arm.com > I'll try and dig through the email I have. Thanks. > > Well that also looks fine to me, and that would mean this: > > > > - We remove SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL flags, but still > > call the utilization callbacks from RT and DL classes. > > Didn't juri have patches to make DL do something sane? But yes, I think > those flags are part of the problem. Sure, DL will be more like CFS going forward. I was just commenting based on what we have upstream today. > > - From the utilization handler, we check runqueues of all three sched > > classes to see if they have some work pending (this can be done > > smartly by checking only RT first and skipping other checks if RT > > has some work). > > No that's wrong. DL should provide a minimum required based on existing > reservations, we can add the expected CFS average on top and request > that. Right, that should be the case after Juri's patches. > And for RT all we need to know is if current is of that class, otherwise > we don't care. What about this case: A CFS task is running currently and an RT task is enqueued. - Is it always the case that the CFS task is preempted immediately and the CPU is given to RT task ? I was talking to Vincent earlier and he told me that for certain configurations the CFS task may keep running until the next tick. - What if the CFS task has disabled preemption ? - More corner cases like this ? Above cases may not let schedutil to raise frequency to MAX even when we have RT stuff enqueued. And that's why I tried to track all sched classes for which we have work enqueued for. There are great chances that my understanding here is wrong though :) -- viresh
On Wed, Dec 20, 2017 at 02:18:59PM +0530, Viresh Kumar wrote: > On 20-12-17, 09:31, Peter Zijlstra wrote: > What about this case: A CFS task is running currently and an RT task > is enqueued. > > - Is it always the case that the CFS task is preempted immediately and > the CPU is given to RT task ? I was talking to Vincent earlier and > he told me that for certain configurations the CFS task may keep > running until the next tick. > > - What if the CFS task has disabled preemption ? > > - More corner cases like this ? If there is a runnable RT task we'll schedule to it ASAP. This means for: CONFIG_PREEMPT=y either immediately or when next preempt_enable() hits a 0 preempt_count. CONFIG_PREEMPT=n on any next kernel preemption point or when returning to userspace. But this is not something we should worry about.
On 20-Dec 09:31, Peter Zijlstra wrote: > On Wed, Dec 20, 2017 at 09:34:46AM +0530, Viresh Kumar wrote: > > On 19-12-17, 20:25, Peter Zijlstra wrote: > > > Yeah, not happy about this either; we had code that did the right thing > > > without this extra tracking I think. > > > > Sure, but how do you suggest we fix the problems we are facing with > > the current design? Patrick had a completely different proposal for > > solving those problems, which I didn't like very much. This patchset > > replaced these patches from Patrick: > > Please use the normal link format: > > https://lkml.kernel.org/r/$MSGID > > Then I can find them without having to resort to a frigging browser > thing. > > I'll try and dig through the email I have. > > > > Also, we can look at the rq state if we want to, we don't need to > > > duplicate that state. > > > > Well that also looks fine to me, and that would mean this: > > > > - We remove SCHED_CPUFREQ_RT and SCHED_CPUFREQ_DL flags, but still > > call the utilization callbacks from RT and DL classes. > > Didn't juri have patches to make DL do something sane? But yes, I think > those flags are part of the problem. He recently reposted them here: https://lkml.kernel.org/r/20171204102325.5110-1-juri.lelli@redhat.com > > - From the utilization handler, we check runqueues of all three sched > > classes to see if they have some work pending (this can be done > > smartly by checking only RT first and skipping other checks if RT > > has some work). > > No that's wrong. DL should provide a minimum required based on existing > reservations, we can add the expected CFS average on top and request > that. > > And for RT all we need to know is if current is of that class, otherwise > we don't care. So, this: https://marc.info/?i=20171130114723.29210-3-patrick.bellasi%40arm.com was actually going in this direction, although still working on top of flags to not change the existing interface too much. IMO, the advantage of flags is that they are a sort-of "pro-active" approach, where the scheduler notify sensible events to schedutil. But keep adding flags seems to overkilling to me too. If we remove flags then we have to query the scheduler classes "on demand"... but, as Peter suggests, once we have DL bits Juri posted, the only issue if to know if an RT task is running. This the patch above can be just good enough, with no flags at all and with just a check for current being RT (or DL for the time being). -- #include <best/regards.h> Patrick Bellasi
On Wed, Dec 20, 2017 at 12:55:46PM +0000, Patrick Bellasi wrote: > On 20-Dec 09:31, Peter Zijlstra wrote: > > Didn't juri have patches to make DL do something sane? But yes, I think > > those flags are part of the problem. > > He recently reposted them here: > > https://lkml.kernel.org/r/20171204102325.5110-1-juri.lelli@redhat.com Yeah, just found them and actually munged them into my queue; did all the modifications you suggested too. Lets see if it comes apart. > > > - From the utilization handler, we check runqueues of all three sched > > > classes to see if they have some work pending (this can be done > > > smartly by checking only RT first and skipping other checks if RT > > > has some work). > > > > No that's wrong. DL should provide a minimum required based on existing > > reservations, we can add the expected CFS average on top and request > > that. > > > > And for RT all we need to know is if current is of that class, otherwise > > we don't care. > > So, this: > > https://marc.info/?i=20171130114723.29210-3-patrick.bellasi%40arm.com Right, I was actually looking for those patches, but I'm searching backwards and hit upon Juri's patches first. > was actually going in this direction, although still working on top of > flags to not change the existing interface too much. > > IMO, the advantage of flags is that they are a sort-of "pro-active" > approach, where the scheduler notify sensible events to schedutil. > But keep adding flags seems to overkilling to me too. > > If we remove flags then we have to query the scheduler classes "on > demand"... but, as Peter suggests, once we have DL bits Juri posted, > the only issue if to know if an RT task is running. > This the patch above can be just good enough, with no flags at all and > with just a check for current being RT (or DL for the time being). Well, we still need flags for crap like IO-WAIT IIRC. That's sugov internal state and not something the scheduler actually already knows. But let me continue searching for patches.. Ooh, I found patches from Brendan... should be very close to yours though, going by that msgid you posted on Nov 30th and I'm now on Dec 1st, soooon... :-)
On Wed, Dec 20, 2017 at 03:47:23PM +0100, Rafael J. Wysocki wrote: > > Well, we still need flags for crap like IO-WAIT IIRC. That's sugov > > internal state and not something the scheduler actually already knows. > > Not only sugov to be precise, but yes. Oh, right, intel_pstate also consumes that one..
On Wednesday, December 20, 2017 3:31:00 PM CET Patrick Bellasi wrote: > On 20-Dec 14:28, Peter Zijlstra wrote: > > On Wed, Dec 20, 2017 at 12:55:46PM +0000, Patrick Bellasi wrote: > > > On 20-Dec 09:31, Peter Zijlstra wrote: > > > > > > Didn't juri have patches to make DL do something sane? But yes, I think > > > > those flags are part of the problem. > > > > > > He recently reposted them here: > > > > > > https://lkml.kernel.org/r/20171204102325.5110-1-juri.lelli@redhat.com > > > > Yeah, just found them and actually munged them into my queue; did all > > the modifications you suggested too. Lets see if it comes apart. > > > > > > > - From the utilization handler, we check runqueues of all three sched > > > > > classes to see if they have some work pending (this can be done > > > > > smartly by checking only RT first and skipping other checks if RT > > > > > has some work). > > > > > > > > No that's wrong. DL should provide a minimum required based on existing > > > > reservations, we can add the expected CFS average on top and request > > > > that. > > > > > > > > And for RT all we need to know is if current is of that class, otherwise > > > > we don't care. > > > > > > So, this: > > > > > > https://marc.info/?i=20171130114723.29210-3-patrick.bellasi%40arm.com > > > > Right, I was actually looking for those patches, but I'm searching > > backwards and hit upon Juri's patches first. > > > > > was actually going in this direction, although still working on top of > > > flags to not change the existing interface too much. > > > > > > IMO, the advantage of flags is that they are a sort-of "pro-active" > > > approach, where the scheduler notify sensible events to schedutil. > > > But keep adding flags seems to overkilling to me too. > > > > > > If we remove flags then we have to query the scheduler classes "on > > > demand"... but, as Peter suggests, once we have DL bits Juri posted, > > > the only issue if to know if an RT task is running. > > > This the patch above can be just good enough, with no flags at all and > > > with just a check for current being RT (or DL for the time being). > > > > Well, we still need flags for crap like IO-WAIT IIRC. That's sugov > > internal state and not something the scheduler actually already knows. > > Right, that flag is set from: > > core.c::io_schedule_prepare() > > for the current task, which is going to be dequeued soon. > > Once it wakes up the next time, at enqueue time we trigger a boosting > by passing schedutil that flag. > > Thus, unless we are happy to delay the boosting until the task is > actually picked for execution (don't think so), then we need to keep > the flag and signal schedutil at enqueue time. > > However, was wondering one thing: should't we already have a vruntime > bonus for IO sleeping tasks? Because in that case, the task is likely > to be on CPU quite soon... and thus, perhaps by removing the flag and > moving the schedutil notification into core.c at the end of > __schedule() should be working to detect both RT and FAIR::IOWAIT > boosted tasks. schedutil is not the only user of this flag. Thanks, Rafael
On 20-Dec 15:52, Rafael J. Wysocki wrote: > On Wednesday, December 20, 2017 3:31:00 PM CET Patrick Bellasi wrote: > > On 20-Dec 14:28, Peter Zijlstra wrote: > > > On Wed, Dec 20, 2017 at 12:55:46PM +0000, Patrick Bellasi wrote: > > > > On 20-Dec 09:31, Peter Zijlstra wrote: > > > > > > > > Didn't juri have patches to make DL do something sane? But yes, I think > > > > > those flags are part of the problem. > > > > > > > > He recently reposted them here: > > > > > > > > https://lkml.kernel.org/r/20171204102325.5110-1-juri.lelli@redhat.com > > > > > > Yeah, just found them and actually munged them into my queue; did all > > > the modifications you suggested too. Lets see if it comes apart. > > > > > > > > > - From the utilization handler, we check runqueues of all three sched > > > > > > classes to see if they have some work pending (this can be done > > > > > > smartly by checking only RT first and skipping other checks if RT > > > > > > has some work). > > > > > > > > > > No that's wrong. DL should provide a minimum required based on existing > > > > > reservations, we can add the expected CFS average on top and request > > > > > that. > > > > > > > > > > And for RT all we need to know is if current is of that class, otherwise > > > > > we don't care. > > > > > > > > So, this: > > > > > > > > https://marc.info/?i=20171130114723.29210-3-patrick.bellasi%40arm.com > > > > > > Right, I was actually looking for those patches, but I'm searching > > > backwards and hit upon Juri's patches first. > > > > > > > was actually going in this direction, although still working on top of > > > > flags to not change the existing interface too much. > > > > > > > > IMO, the advantage of flags is that they are a sort-of "pro-active" > > > > approach, where the scheduler notify sensible events to schedutil. > > > > But keep adding flags seems to overkilling to me too. > > > > > > > > If we remove flags then we have to query the scheduler classes "on > > > > demand"... but, as Peter suggests, once we have DL bits Juri posted, > > > > the only issue if to know if an RT task is running. > > > > This the patch above can be just good enough, with no flags at all and > > > > with just a check for current being RT (or DL for the time being). > > > > > > Well, we still need flags for crap like IO-WAIT IIRC. That's sugov > > > internal state and not something the scheduler actually already knows. > > > > Right, that flag is set from: > > > > core.c::io_schedule_prepare() > > > > for the current task, which is going to be dequeued soon. > > > > Once it wakes up the next time, at enqueue time we trigger a boosting > > by passing schedutil that flag. > > > > Thus, unless we are happy to delay the boosting until the task is > > actually picked for execution (don't think so), then we need to keep > > the flag and signal schedutil at enqueue time. > > > > However, was wondering one thing: should't we already have a vruntime > > bonus for IO sleeping tasks? Because in that case, the task is likely > > to be on CPU quite soon... and thus, perhaps by removing the flag and > > moving the schedutil notification into core.c at the end of > > __schedule() should be working to detect both RT and FAIR::IOWAIT > > boosted tasks. > > schedutil is not the only user of this flag. Sure, but with the idea above (not completely sure it makes sense) intel_pstate_update_util() can still get the IIOWAIT information. We just get that info from current->in_iowait instead of checking a flag which is passed in via callback. > Thanks, > Rafael > -- #include <best/regards.h> Patrick Bellasi
On Wed, Dec 20, 2017 at 06:27:17PM +0100, Juri Lelli wrote: > Thanks Peter for taking the patches. I was actually waiting for the flag > thing to be resolved to post again. :/ Yeah, I took them because it made sorting that easier, n/p.
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h index d1ad3d825561..6f6641e61236 100644 --- a/include/linux/sched/cpufreq.h +++ b/include/linux/sched/cpufreq.h @@ -8,10 +8,15 @@ * Interface between cpufreq drivers and the scheduler: */ +#define SCHED_CPUFREQ_CLEAR (1U << 31) #define SCHED_CPUFREQ_RT (1U << 0) #define SCHED_CPUFREQ_DL (1U << 1) -#define SCHED_CPUFREQ_IOWAIT (1U << 2) +#define SCHED_CPUFREQ_CFS (1U << 2) +#define SCHED_CPUFREQ_IOWAIT (1U << 3) +#define SCHED_CPUFREQ_RT_CLEAR (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_CLEAR) +#define SCHED_CPUFREQ_DL_CLEAR (SCHED_CPUFREQ_DL | SCHED_CPUFREQ_CLEAR) +#define SCHED_CPUFREQ_CFS_CLEAR (SCHED_CPUFREQ_CFS | SCHED_CPUFREQ_CLEAR) #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL) #ifdef CONFIG_CPU_FREQ diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index e8ccfa30f01a..60a2dea4c8cc 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -191,6 +191,8 @@ static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags) { if (flags & SCHED_CPUFREQ_IOWAIT) { + sg_cpu->flags &= ~SCHED_CPUFREQ_IOWAIT; + if (sg_cpu->iowait_boost_pending) return; @@ -264,6 +266,13 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, unsigned int next_f; bool busy; + if (unlikely(flags & SCHED_CPUFREQ_CLEAR)) { + sg_cpu->flags &= ~flags; + return; + } + + sg_cpu->flags |= flags; + sugov_set_iowait_boost(sg_cpu, time, flags); sg_cpu->last_update = time; @@ -272,7 +281,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, busy = sugov_cpu_is_busy(sg_cpu); - if (flags & SCHED_CPUFREQ_RT_DL) { + if (sg_cpu->flags & SCHED_CPUFREQ_RT_DL) { next_f = policy->cpuinfo.max_freq; } else { sugov_get_util(&util, &max, sg_cpu->cpu); @@ -345,15 +354,20 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time, raw_spin_lock(&sg_policy->update_lock); + if (unlikely(flags & SCHED_CPUFREQ_CLEAR)) { + sg_cpu->flags &= ~flags; + goto unlock; + } + sg_cpu->util = util; sg_cpu->max = max; - sg_cpu->flags = flags; + sg_cpu->flags |= flags; sugov_set_iowait_boost(sg_cpu, time, flags); sg_cpu->last_update = time; if (sugov_should_update_freq(sg_policy, time)) { - if (flags & SCHED_CPUFREQ_RT_DL) + if (sg_cpu->flags & SCHED_CPUFREQ_RT_DL) next_f = sg_policy->policy->cpuinfo.max_freq; else next_f = sugov_next_freq_shared(sg_cpu, time); @@ -361,6 +375,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time, sugov_update_commit(sg_policy, time, next_f); } +unlock: raw_spin_unlock(&sg_policy->update_lock); } diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 2473736c7616..d9c7c6887493 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1472,6 +1472,10 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) */ if (flags & DEQUEUE_SLEEP) task_non_contending(p); + + /* Clear cpufreq flags after last deadline task is dequeued */ + if (!rq->dl.dl_nr_running) + cpufreq_update_util(rq, SCHED_CPUFREQ_DL_CLEAR); } /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2915c0d95107..492188c3ee2d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3033,7 +3033,7 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) * * See cpu_util(). */ - cpufreq_update_util(rq, 0); + cpufreq_update_util(rq, SCHED_CPUFREQ_CFS); } } @@ -5214,7 +5214,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) * passed. */ if (p->in_iowait) - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT); + cpufreq_update_util(rq, SCHED_CPUFREQ_CFS | SCHED_CPUFREQ_IOWAIT); for_each_sched_entity(se) { if (se->on_rq) @@ -5309,6 +5309,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) sub_nr_running(rq, 1); hrtick_update(rq); + + /* Clear cpufreq flags after last CFS task is dequeued */ + if (!rq->cfs.nr_running) + cpufreq_update_util(rq, SCHED_CPUFREQ_CFS_CLEAR); } #ifdef CONFIG_SMP diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 862a513adca3..c9e8a8e5641b 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1337,6 +1337,10 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags) dequeue_rt_entity(rt_se, flags); dequeue_pushable_task(rq, p); + + /* Clear cpufreq flags after last rt task is dequeued */ + if (!rq->rt.rt_nr_running) + cpufreq_update_util(rq, SCHED_CPUFREQ_RT_CLEAR); } /*
Currently the schedutil governor overwrites the sg_cpu->flags field on every call to the utilization handler. It was pretty good as the initial implementation of utilization handlers, there are several drawbacks though. The biggest drawback is that the sg_cpu->flags field doesn't always represent the correct type of tasks that are enqueued on a CPU's rq. For example, if a fair task is enqueued while a RT or DL task is running, we will overwrite the flags with value 0 and that may take the CPU to lower OPPs unintentionally. There can be other corner cases as well which we aren't aware of currently. This patch changes the current implementation to keep track of all the task types that are currently enqueued to the CPUs rq. A new flag: CLEAR is introduced and is set by the scheduling classes when their last task is dequeued. When the CLEAR flag bit is set, the schedutil governor resets all the other flag bits that are present in the flags parameter. For now, the util update handlers return immediately if they were called to clear the flag. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> --- include/linux/sched/cpufreq.h | 7 ++++++- kernel/sched/cpufreq_schedutil.c | 21 ++++++++++++++++++--- kernel/sched/deadline.c | 4 ++++ kernel/sched/fair.c | 8 ++++++-- kernel/sched/rt.c | 4 ++++ 5 files changed, 38 insertions(+), 6 deletions(-) -- 2.15.0.194.g9af6a3dea062