Message ID | 20170706094948.8779-3-dietmar.eggemann@arm.com |
---|---|
State | New |
Headers | show |
Series | arm, arm64: frequency- and cpu-invariant accounting support for task scheduler | expand |
On 06-07-17, 10:49, Dietmar Eggemann wrote: > A frequency-invariant load-tracking solution based on cpufreq transition > notifier will not work for future fast frequency switching policies. > That is why a different solution is presented with this patch. > > Let cpufreq call the function arch_set_freq_scale() to pass the current > frequency, the max supported frequency and the cpumask of the related > cpus to a consumer (an arch) which defines arch_set_freq_scale(). > > The consumer has to associate arch_set_freq_scale with the name of its > own implementation foo_set_freq_scale() to overwrite the empty standard > definition in drivers/cpufreq/cpufreq.c. > An arch could do this in one of its arch-specific header files > (e.g. arch/$ARCH/include/asm/topology.h) which gets included in > drivers/cpufreq/cpufreq.c. > > In case arch_set_freq_scale() is not defined (and because of the > pr_debug() drivers/cpufreq/cpufreq.c is not compiled with -DDEBUG) The line within () needs to be improved to convey a clear message. > the > function cpufreq_set_freq_scale() gets compiled out. > > Cc: Rafael J. Wysocki <rjw@rjwysocki.net> > Cc: Viresh Kumar <viresh.kumar@linaro.org> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > --- > drivers/cpufreq/cpufreq.c | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > index 9bf97a366029..a04c5886a5ce 100644 > --- a/drivers/cpufreq/cpufreq.c > +++ b/drivers/cpufreq/cpufreq.c > @@ -347,6 +347,28 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy, > } > } > > +/********************************************************************* > + * FREQUENCY INVARIANT CPU CAPACITY SUPPORT * > + *********************************************************************/ > + > +#ifndef arch_set_freq_scale > +static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq, > + unsigned long max_freq) > +{} > +#endif > + > +static void cpufreq_set_freq_scale(struct cpufreq_policy *policy, > + struct cpufreq_freqs *freqs) > +{ > + unsigned long cur_freq = freqs ? freqs->new : policy->cur; > + unsigned long max_freq = policy->cpuinfo.max_freq; > + > + pr_debug("cpus %*pbl cur/cur max freq %lu/%lu kHz\n", > + cpumask_pr_args(policy->related_cpus), cur_freq, max_freq); > + > + arch_set_freq_scale(policy->related_cpus, cur_freq, max_freq); I am not sure why all these are required to be sent here and will come back to it later on after going through other patches. > +} > + > /** > * cpufreq_notify_transition - call notifier chain and adjust_jiffies > * on frequency transition. > @@ -405,6 +427,8 @@ void cpufreq_freq_transition_begin(struct cpufreq_policy *policy, > > spin_unlock(&policy->transition_lock); > > + cpufreq_set_freq_scale(policy, freqs); > + Why do this before even changing the frequency ? We may fail while changing it. IMHO, you should call this routine whenever we update policy->cur and that happens regularly in __cpufreq_notify_transition() and few other places.. > cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE); > } > EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin); > @@ -2203,6 +2227,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, > blocking_notifier_call_chain(&cpufreq_policy_notifier_list, > CPUFREQ_NOTIFY, new_policy); > > + cpufreq_set_freq_scale(new_policy, NULL); Why added it here ? To get it initialized ? If yes, then we should do that in cpufreq_online() where we first initialize policy->cur. Apart from this, you also need to update this in the schedutil governor (if you haven't done that in this series later) as that also updates policy->cur in the fast path. -- viresh
On Fri, Jul 7, 2017 at 6:01 PM, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > On 06/07/17 11:40, Viresh Kumar wrote: >> On 06-07-17, 10:49, Dietmar Eggemann wrote: > > [...] > >>> In case arch_set_freq_scale() is not defined (and because of the >>> pr_debug() drivers/cpufreq/cpufreq.c is not compiled with -DDEBUG) >> >> The line within () needs to be improved to convey a clear message. > > Probably not needed anymore. See below. > > [...] > >>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c >>> index 9bf97a366029..a04c5886a5ce 100644 >>> --- a/drivers/cpufreq/cpufreq.c >>> +++ b/drivers/cpufreq/cpufreq.c >>> @@ -347,6 +347,28 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy, >>> } >>> } >>> >>> +/********************************************************************* >>> + * FREQUENCY INVARIANT CPU CAPACITY SUPPORT * >>> + *********************************************************************/ >>> + >>> +#ifndef arch_set_freq_scale >>> +static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq, >>> + unsigned long max_freq) >>> +{} >>> +#endif >>> + >>> +static void cpufreq_set_freq_scale(struct cpufreq_policy *policy, >>> + struct cpufreq_freqs *freqs) >>> +{ >>> + unsigned long cur_freq = freqs ? freqs->new : policy->cur; >>> + unsigned long max_freq = policy->cpuinfo.max_freq; >>> + >>> + pr_debug("cpus %*pbl cur/cur max freq %lu/%lu kHz\n", >>> + cpumask_pr_args(policy->related_cpus), cur_freq, max_freq); >>> + >>> + arch_set_freq_scale(policy->related_cpus, cur_freq, max_freq); >> >> I am not sure why all these are required to be sent here and will come back to >> it later on after going through other patches. > > See below. > >>> +} >>> + >>> /** >>> * cpufreq_notify_transition - call notifier chain and adjust_jiffies >>> * on frequency transition. >>> @@ -405,6 +427,8 @@ void cpufreq_freq_transition_begin(struct cpufreq_policy *policy, >>> >>> spin_unlock(&policy->transition_lock); >>> >>> + cpufreq_set_freq_scale(policy, freqs); >>> + >> >> Why do this before even changing the frequency ? We may fail while changing it. >> >> IMHO, you should call this routine whenever we update policy->cur and that >> happens regularly in __cpufreq_notify_transition() and few other places.. > > See below. > >>> cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE); >>> } >>> EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin); >>> @@ -2203,6 +2227,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, >>> blocking_notifier_call_chain(&cpufreq_policy_notifier_list, >>> CPUFREQ_NOTIFY, new_policy); >>> >>> + cpufreq_set_freq_scale(new_policy, NULL); >> >> Why added it here ? To get it initialized ? If yes, then we should do that in >> cpufreq_online() where we first initialize policy->cur. > > I agree. This can go away. Initialization is not really needed here. We initialize > the scale values to SCHED_CAPACITY_SCALE at boot-time. > >> Apart from this, you also need to update this in the schedutil governor (if you >> haven't done that in this series later) as that also updates policy->cur in the >> fast path. > > So what about I call arch_set_freq_scale() in __cpufreq_notify_transition() in the > CPUFREQ_POSTCHANGE case for slow-switching and in cpufreq_driver_fast_switch() for > fast-switching? Why don't you do this in drivers instead of in the core? Ultimately, the driver knows what frequency it has requested, so why can't it call arch_set_freq_scale()? Thanks, Rafael
On 10-07-17, 13:02, Dietmar Eggemann wrote: > Yes, I will change this. The #define approach is not really necessary > here since we're not in the scheduler hot-path and inlining is not > really required here. It would be part of scheduler hot-path for the fast-switching case, isn't it ? (I am not arguing against using weak functions, just wanted to correct above statement). -- viresh
On Tuesday, July 11, 2017 04:06:01 PM Dietmar Eggemann wrote: > On 11/07/17 07:01, Viresh Kumar wrote: > > On 10-07-17, 13:02, Dietmar Eggemann wrote: > >> Yes, I will change this. The #define approach is not really necessary > >> here since we're not in the scheduler hot-path and inlining is not > >> really required here. > > > > It would be part of scheduler hot-path for the fast-switching case, isn't it ? > > (I am not arguing against using weak functions, just wanted to correct above > > statement). > > Yes you're right here. > > But in the meantime we're convinced that cpufreq_driver_fast_switch() is > not the right place to call arch_set_freq_scale() since for (future) > arm/arm64 fast-switch driver, the return value of > cpufreq_driver->fast_switch() does not give us the information that the > frequency value did actually change. > > So we probably have to do this soemwhere in the cpufreq driver(s) to > support fast-switching until we have aperf/mperf like counters. If that's the case, I'd say call arch_set_freq_scale() from drivers in all cases or it will get *really* confusing. Thanks, Rafael
On 11/07/17 07:01, Viresh Kumar wrote: > On 10-07-17, 13:02, Dietmar Eggemann wrote: >> Yes, I will change this. The #define approach is not really necessary >> here since we're not in the scheduler hot-path and inlining is not >> really required here. > > It would be part of scheduler hot-path for the fast-switching case, isn't it ? > (I am not arguing against using weak functions, just wanted to correct above > statement). Yes you're right here. But in the meantime we're convinced that cpufreq_driver_fast_switch() is not the right place to call arch_set_freq_scale() since for (future) arm/arm64 fast-switch driver, the return value of cpufreq_driver->fast_switch() does not give us the information that the frequency value did actually change. So we probably have to do this soemwhere in the cpufreq driver(s) to support fast-switching until we have aperf/mperf like counters.
On 11/07/17 15:59, Rafael J. Wysocki wrote: > On Tuesday, July 11, 2017 04:06:01 PM Dietmar Eggemann wrote: >> On 11/07/17 07:01, Viresh Kumar wrote: >>> On 10-07-17, 13:02, Dietmar Eggemann wrote: >>>> Yes, I will change this. The #define approach is not really necessary >>>> here since we're not in the scheduler hot-path and inlining is not >>>> really required here. >>> >>> It would be part of scheduler hot-path for the fast-switching case, isn't it ? >>> (I am not arguing against using weak functions, just wanted to correct above >>> statement). >> >> Yes you're right here. >> >> But in the meantime we're convinced that cpufreq_driver_fast_switch() is >> not the right place to call arch_set_freq_scale() since for (future) >> arm/arm64 fast-switch driver, the return value of >> cpufreq_driver->fast_switch() does not give us the information that the >> frequency value did actually change. >> >> So we probably have to do this soemwhere in the cpufreq driver(s) to >> support fast-switching until we have aperf/mperf like counters. > > If that's the case, I'd say call arch_set_freq_scale() from drivers in all > cases or it will get *really* confusing. Agreed, we should do it for slow-switching drivers from within the driver as well in this case.
On 11-07-17, 16:06, Dietmar Eggemann wrote: > But in the meantime we're convinced that cpufreq_driver_fast_switch() is > not the right place to call arch_set_freq_scale() since for (future) > arm/arm64 fast-switch driver, the return value of > cpufreq_driver->fast_switch() does not give us the information that the > frequency value did actually change. Yeah, I saw your discussion with Peter on #linux-rt IRC and TBH I wasn't aware that we are going to do fast switching that way. Just trying to get understanding of that idea a bit.. So we will do fast switching from scheduler's point of view, i.e. we wouldn't schedule a kthread to change the frequency. But the real hardware still can't do that without sleeping, like if we have I2C somewhere in between. AFAIU, we will still have some kind of *software* bottom half to do that work, isn't it? And it wouldn't be that we have pushed some instructions to the hardware, which it can do a bit later. For example, the regulator may be accessed via I2C and we need to program that before changing the clock. So, it will be done by some software code only. And now I am wondering on why that would be any better than the kthread in schedutil. Sorry, I haven't understood the idea completely yet :( -- viresh
On Wed, Jul 12, 2017 at 09:39:17AM +0530, Viresh Kumar wrote: > Yeah, I saw your discussion with Peter on #linux-rt IRC and TBH I wasn't aware > that we are going to do fast switching that way. Just trying to get > understanding of that idea a bit.. > > So we will do fast switching from scheduler's point of view, i.e. we wouldn't > schedule a kthread to change the frequency. But the real hardware still can't do > that without sleeping, like if we have I2C somewhere in between. AFAIU, we will > still have some kind of *software* bottom half to do that work, isn't it? And it > wouldn't be that we have pushed some instructions to the hardware, which it can > do a bit later. > > For example, the regulator may be accessed via I2C and we need to program that > before changing the clock. So, it will be done by some software code only. > > And now I am wondering on why that would be any better than the kthread in > schedutil. Sorry, I haven't understood the idea completely yet :( So the problem with the thread is two-fold; one the one hand we like the scheduler to directly set frequency, but then we need to schedule a task to change the frequency, which will change the frequency and around we go. On the other hand, there's very nasty issues with PI. This thread would have very high priority (otherwise the SCHED_DEADLINE stuff won't work) but that then means this thread needs to boost the owner of the i2c mutex. And that then creates a massive bandwidth accounting hole. The advantage of using an interrupt driven state machine is that all those issues go away. But yes, whichever way around you turn things, its crap. But given the hardware its the best we can do.
On 12/07/17 12:14, Peter Zijlstra wrote: > On Wed, Jul 12, 2017 at 02:57:55PM +0530, Viresh Kumar wrote: >> On 12-07-17, 10:31, Peter Zijlstra wrote: >>> So the problem with the thread is two-fold; one the one hand we like the >>> scheduler to directly set frequency, but then we need to schedule a task >>> to change the frequency, which will change the frequency and around we >>> go. >>> >>> On the other hand, there's very nasty issues with PI. This thread would >>> have very high priority (otherwise the SCHED_DEADLINE stuff won't work) >>> but that then means this thread needs to boost the owner of the i2c >>> mutex. And that then creates a massive bandwidth accounting hole. >>> >>> >>> The advantage of using an interrupt driven state machine is that all >>> those issues go away. >>> >>> But yes, whichever way around you turn things, its crap. But given the >>> hardware its the best we can do. >> >> Thanks for the explanation Peter. >> >> IIUC, it will take more time to change the frequency eventually with >> the interrupt-driven state machine as there may be multiple bottom >> halves involved here, for supply, clk, etc, which would run at normal >> priorities now. And those were boosted currently due to the high >> priority sugov thread. And we are fine with that (from performance >> point of view) ? > > I'm not sure what you mean; bottom halves as in softirq? From what I can > tell an i2c bus does clk_prepare_enable() on registration and from that > point on clk_enable() is usable from atomic contexts. But afaict clk > stuff doesn't do interrupts at all. > > (with a note that I absolutely hate the clk locking) > Agreed. Juri pointed out this as a blocker a while ago and when we started implementing the new and shiny ARM SCMI specification, I dropped the whole clock layer interaction for the CPUFreq driver. However, I still have to deal with some mailbox locking(still experimenting currently) > I think the interrupt driven thing can actually be faster than the > 'regular' task waiting on the mutex. The regulator message can be > locklessly queued (it only ever makes sense to have 1 such message > pending, any later one will invalidate a prior one). > Ah OK, I just asked the same in the other thread, you have already answered me. Good we can ignore. > Then the i2c interrupt can detect the availability of this pending > message and splice it into the transfer queue at an opportune moment. > > (of course, the current i2c bits don't support any of that) > >> Coming back to where we started from (where should we call >> arch_set_freq_scale() from ?). > > The drivers.. the core cpufreq doesn't know when (if any) transition is > completed. > >> I think we would still need some kind of synchronization between >> cpufreq core and the cpufreq drivers to make sure we don't start >> another freq change before the previous one is complete. Otherwise >> the cpufreq drivers would be required to have similar support with >> proper locking in place. > > Not sure what you mean; also not sure why. On x86 we never know, cannot > know. So why would this stuff be any different. > Good, I was under the same assumption that it's okay to override the old request with new. >> And if the core is going to get notified about successful freq changes >> (which it should IMHO), then it may still be better to call >> arch_set_freq_scale() from the core itself and not from individual >> drivers. > > I would not involve the core. All we want from the core is a unified > interface towards requesting DVFS changes. Everything that happens after > is not its business. > The question is whether we *need* to know the completion of frequency transition. What is the impact of absence of it ? I am considering platforms which may take up to a ms or more to do the actual transition in the firmware. -- Regards, Sudeep
On Thu, Jul 13, 2017 at 03:04:09PM +0100, Sudeep Holla wrote: > The question is whether we *need* to know the completion of frequency > transition. What is the impact of absence of it ? I am considering > platforms which may take up to a ms or more to do the actual transition > in the firmware. So on x86 we can recover from not knowing by means of the APERF/MPERF thing, which gives us the average effective frequency over the last period. If you lack that you need something to update the actual effective frequency. Changing the effective frequency at request time might confuse things -- esp. if the request might not be honoured at all or can take a significant time to complete. Not to mention that _IF_ you rely on the effective frequency to set other clocks things can come unstuck. So unless you go the whole distance and do APERF/MPERF like things, I think it would be very good to have a notification of completion (and possibly a read-back of the effective frequency that is now set).
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 9bf97a366029..a04c5886a5ce 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -347,6 +347,28 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy, } } +/********************************************************************* + * FREQUENCY INVARIANT CPU CAPACITY SUPPORT * + *********************************************************************/ + +#ifndef arch_set_freq_scale +static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq, + unsigned long max_freq) +{} +#endif + +static void cpufreq_set_freq_scale(struct cpufreq_policy *policy, + struct cpufreq_freqs *freqs) +{ + unsigned long cur_freq = freqs ? freqs->new : policy->cur; + unsigned long max_freq = policy->cpuinfo.max_freq; + + pr_debug("cpus %*pbl cur/cur max freq %lu/%lu kHz\n", + cpumask_pr_args(policy->related_cpus), cur_freq, max_freq); + + arch_set_freq_scale(policy->related_cpus, cur_freq, max_freq); +} + /** * cpufreq_notify_transition - call notifier chain and adjust_jiffies * on frequency transition. @@ -405,6 +427,8 @@ void cpufreq_freq_transition_begin(struct cpufreq_policy *policy, spin_unlock(&policy->transition_lock); + cpufreq_set_freq_scale(policy, freqs); + cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE); } EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin); @@ -2203,6 +2227,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, blocking_notifier_call_chain(&cpufreq_policy_notifier_list, CPUFREQ_NOTIFY, new_policy); + cpufreq_set_freq_scale(new_policy, NULL); + policy->min = new_policy->min; policy->max = new_policy->max;
A frequency-invariant load-tracking solution based on cpufreq transition notifier will not work for future fast frequency switching policies. That is why a different solution is presented with this patch. Let cpufreq call the function arch_set_freq_scale() to pass the current frequency, the max supported frequency and the cpumask of the related cpus to a consumer (an arch) which defines arch_set_freq_scale(). The consumer has to associate arch_set_freq_scale with the name of its own implementation foo_set_freq_scale() to overwrite the empty standard definition in drivers/cpufreq/cpufreq.c. An arch could do this in one of its arch-specific header files (e.g. arch/$ARCH/include/asm/topology.h) which gets included in drivers/cpufreq/cpufreq.c. In case arch_set_freq_scale() is not defined (and because of the pr_debug() drivers/cpufreq/cpufreq.c is not compiled with -DDEBUG) the function cpufreq_set_freq_scale() gets compiled out. Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> --- drivers/cpufreq/cpufreq.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) -- 2.11.0