Message ID | 1539102302-9057-1-git-send-email-thara.gopinath@linaro.org |
---|---|
Headers | show |
Series | Introduce thermal pressure | expand |
On Tue, Oct 09, 2018 at 12:24:55PM -0400, Thara Gopinath wrote: > Thermal governors can respond to an overheat event for a cpu by > capping the cpu's maximum possible frequency. This in turn > means that the maximum available compute capacity of the > cpu is restricted. But today in linux kernel, in event of maximum > frequency capping of a cpu, the maximum available compute > capacity of the cpu is not adjusted at all. In other words, scheduler > is unware maximum cpu capacity restrictions placed due to thermal > activity. Interesting, I would have sworn that I tested this years ago by lowering the maximum frequency of a cpufreq domain, and the scheduler reacted accordingly to the new maximum capacities of the cpus. > This patch series attempts to address this issue. > The benefits identified are better task placement among available > cpus in event of overheating which in turn leads to better > performance numbers. > > The delta between the maximum possible capacity of a cpu and > maximum available capacity of a cpu due to thermal event can > be considered as thermal pressure. Instantaneous thermal pressure > is hard to record and can sometime be erroneous as there can be mismatch > between the actual capping of capacity and scheduler recording it. > Thus solution is to have a weighted average per cpu value for thermal > pressure over time. The weight reflects the amount of time the cpu has > spent at a capped maximum frequency. To accumulate, average and > appropriately decay thermal pressure, this patch series uses pelt > signals and reuses the available framework that does a similar > bookkeeping of rt/dl task utilization. > > Regarding testing, basic build, boot and sanity testing have been > performed on hikey960 mainline kernel with debian file system. > Further aobench (An occlusion renderer for benchmarking realworld > floating point performance) showed the following results on hikey960 > with debain. > > Result Standard Standard > (Time secs) Error Deviation > Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% > Hikey 960 - thermal pressure applied 122.37 5.78 11.57% > > Thara Gopinath (7): > sched/pelt: Add option to make load and util calculations frequency > invariant > sched/pelt.c: Add support to track thermal pressure > sched: Add infrastructure to store and update instantaneous thermal > pressure > sched: Initialize per cpu thermal pressure structure > sched/fair: Enable CFS periodic tick to update thermal pressure > sched/fair: update cpu_capcity to reflect thermal pressure > thermal/cpu-cooling: Update thermal pressure in case of a maximum > frequency capping > > drivers/base/arch_topology.c | 1 + > drivers/thermal/cpu_cooling.c | 20 ++++++++++++- thermal? There are other ways in which the maximum frequency of a cpu can be limited, for example from userspace via scaling_max_freq. When something (anything) changes the maximum frequency of a cpufreq policy, the scheduler should be notified. I think this change should be done in cpufreq instead to make it generic and not particular to a given maximum frequency "capper". Cheers, Javi
On Tue, Oct 09, 2018 at 12:25:01PM -0400, Thara Gopinath wrote: > cpu_capacity relflects the maximum available capacity of a cpu. Thermal > pressure on a cpu means this maximum available capacity is reduced. This > patch reduces the average thermal pressure for a cpu from its maximum > available capacity so that cpu_capacity reflects the actual > available capacity. > > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> > --- > kernel/sched/fair.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7deb1d0..8651e55 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7497,6 +7497,7 @@ static unsigned long scale_rt_capacity(int cpu) > > used = READ_ONCE(rq->avg_rt.util_avg); > used += READ_ONCE(rq->avg_dl.util_avg); > + used += READ_ONCE(rq->avg_thermal.load_avg); IIUIC, you are treating thermal pressure as an artificial load on the cpu. If so, this sounds like a hard to maintain hack. Thermal pressure have different characteristics to utilization. What happens if thermal sets the cpu cooling state back to 0 because there is thermal headroom again? Do we keep adding this artificial load to the cpu just because there was thermal pressure in the past and let it decay as if it was cpu load? Cheers, Javi
* Thara Gopinath <thara.gopinath@linaro.org> wrote: > Thermal governors can respond to an overheat event for a cpu by > capping the cpu's maximum possible frequency. This in turn > means that the maximum available compute capacity of the > cpu is restricted. But today in linux kernel, in event of maximum > frequency capping of a cpu, the maximum available compute > capacity of the cpu is not adjusted at all. In other words, scheduler > is unware maximum cpu capacity restrictions placed due to thermal > activity. This patch series attempts to address this issue. > The benefits identified are better task placement among available > cpus in event of overheating which in turn leads to better > performance numbers. > > The delta between the maximum possible capacity of a cpu and > maximum available capacity of a cpu due to thermal event can > be considered as thermal pressure. Instantaneous thermal pressure > is hard to record and can sometime be erroneous as there can be mismatch > between the actual capping of capacity and scheduler recording it. > Thus solution is to have a weighted average per cpu value for thermal > pressure over time. The weight reflects the amount of time the cpu has > spent at a capped maximum frequency. To accumulate, average and > appropriately decay thermal pressure, this patch series uses pelt > signals and reuses the available framework that does a similar > bookkeeping of rt/dl task utilization. > > Regarding testing, basic build, boot and sanity testing have been > performed on hikey960 mainline kernel with debian file system. > Further aobench (An occlusion renderer for benchmarking realworld > floating point performance) showed the following results on hikey960 > with debain. > > Result Standard Standard > (Time secs) Error Deviation > Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% > Hikey 960 - thermal pressure applied 122.37 5.78 11.57% Wow, +13% speedup, impressive! We definitely want this outcome. I'm wondering what happens if we do not track and decay the thermal load at all at the PELT level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal events we receive from the CPU. You describe the averaging as: > Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can > be mismatch between the actual capping of capacity and scheduler recording it. Not sure I follow the argument here: are there bogus thermal throttling events? If so then they are hopefully not frequent enough and should average out over time even if we follow it instantly. I.e. what is 'can sometimes be erroneous', exactly? Thanks, Ingo
On Wed, 10 Oct 2018 at 10:29, Quentin Perret <quentin.perret@arm.com> wrote: > > Hi Thara, > > On Wednesday 10 Oct 2018 at 08:17:51 (+0200), Ingo Molnar wrote: > > > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > > > > > Thermal governors can respond to an overheat event for a cpu by > > > capping the cpu's maximum possible frequency. This in turn > > > means that the maximum available compute capacity of the > > > cpu is restricted. But today in linux kernel, in event of maximum > > > frequency capping of a cpu, the maximum available compute > > > capacity of the cpu is not adjusted at all. In other words, scheduler > > > is unware maximum cpu capacity restrictions placed due to thermal > > > activity. This patch series attempts to address this issue. > > > The benefits identified are better task placement among available > > > cpus in event of overheating which in turn leads to better > > > performance numbers. > > > > > > The delta between the maximum possible capacity of a cpu and > > > maximum available capacity of a cpu due to thermal event can > > > be considered as thermal pressure. Instantaneous thermal pressure > > > is hard to record and can sometime be erroneous as there can be mismatch > > > between the actual capping of capacity and scheduler recording it. > > > Thus solution is to have a weighted average per cpu value for thermal > > > pressure over time. The weight reflects the amount of time the cpu has > > > spent at a capped maximum frequency. To accumulate, average and > > > appropriately decay thermal pressure, this patch series uses pelt > > > signals and reuses the available framework that does a similar > > > bookkeeping of rt/dl task utilization. > > > > > > Regarding testing, basic build, boot and sanity testing have been > > > performed on hikey960 mainline kernel with debian file system. > > > Further aobench (An occlusion renderer for benchmarking realworld > > > floating point performance) showed the following results on hikey960 > > > with debain. > > > > > > Result Standard Standard > > > (Time secs) Error Deviation > > > Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% > > > Hikey 960 - thermal pressure applied 122.37 5.78 11.57% > > > > Wow, +13% speedup, impressive! We definitely want this outcome. > > > > I'm wondering what happens if we do not track and decay the thermal load at all at the PELT > > level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal > > events we receive from the CPU. > > +1, it's not that obvious (to me at least) that averaging the thermal > pressure over time is necessarily what we want. Say the thermal governor > caps a CPU and 'removes' 70% of its capacity, it will take forever for > the PELT signal to ramp-up to that level before the scheduler can react. > And the other way around, if you release the cap, it'll take a while > before we actually start using the newly available capacity. I can also > imagine how reacting too fast can be counter-productive, but I guess > having numbers and/or use-cases to show that would be great :-) The problem with reflecting directly the capping is that it happens far more often than the pace at which cpu_capacity_orig is updated in the scheduler. This means that at the moment when scheduler uses the value, it might not be correct anymore. Then, this value are also used when building the sched_domain and setting max_cpu_capacity which would also implies the rebuilt the sched_domain topology ... The pace of changing the capping is to fast to reflect that in the whole scheduler topology > > Thara, have you tried to experiment with a simpler implementation as > suggested by Ingo ? > > Also, assuming that we do want to average things, do we actually want to > tie the thermal ramp-up time to the PELT half life ? That provides > nice maths properties wrt the other signals, but it's not obvious to me > that this thermal 'constant' should be the same on all platforms. Or > maybe it should ? The main interest of using PELT signal is that thermal pressure will evolve at the same pace as other signals used in the scheduler. With thermal pressure, we have the exact same problem as with RT tasks. The thermal will cap the max frequency which will cap the utilization of the tasks running on the CPU > > Thanks, > Quentin > > > > > You describe the averaging as: > > > > > Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can > > > be mismatch between the actual capping of capacity and scheduler recording it. > > > > Not sure I follow the argument here: are there bogus thermal throttling events? If so then > > they are hopefully not frequent enough and should average out over time even if we follow > > it instantly. > > > > I.e. what is 'can sometimes be erroneous', exactly? > > > > Thanks, > > > > Ingo
Hi Vincent, On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote: > The problem with reflecting directly the capping is that it happens > far more often than the pace at which cpu_capacity_orig is updated in > the scheduler. Hmm, how can you be so sure ? That most likely depends on the workload, the platform and the thermal governor. Some platforms heat up slowly, some quickly. The pace at which the thermal governor will change things should depend on that I assume. > This means that at the moment when scheduler uses the > value, it might not be correct anymore. And OTOH, when you remove a cap for example, it will take time before the scheduler can see the newly available capacity if you need to wait for the signal to decay. So you are using a wrong information too in that scenario. > Then, this value are also used > when building the sched_domain and setting max_cpu_capacity which > would also implies the rebuilt the sched_domain topology ... Wait what ? I thought the thermal cap was reflected in capacity_of, not capacity_orig_of ... You need to rebuild the sched_domain in case of thermal pressure ? Hmm, let me have a closer look at the patches, I must have missed something ... > The pace of changing the capping is to fast to reflect that in the > whole scheduler topology That's probably true in some cases, but it'd be cool to have numbers to back up that statement, I think. Now, if you do need to rebuild the sched domain topology every time you update the thermal pressure, I think the PELT HL is _way_ too short for that ... You can't rebuild the whole thing every 32ms or so. Or am I misunderstanding something ? > > Thara, have you tried to experiment with a simpler implementation as > > suggested by Ingo ? > > > > Also, assuming that we do want to average things, do we actually want to > > tie the thermal ramp-up time to the PELT half life ? That provides > > nice maths properties wrt the other signals, but it's not obvious to me > > that this thermal 'constant' should be the same on all platforms. Or > > maybe it should ? > > The main interest of using PELT signal is that thermal pressure will > evolve at the same pace as other signals used in the scheduler. Right, I think this is a nice property too (assuming that we actually want to average things out). > With > thermal pressure, we have the exact same problem as with RT tasks. The > thermal will cap the max frequency which will cap the utilization of > the tasks running on the CPU Well, the nature of the signal is slightly different IMO. Yes it's capacity, but you're no actually measuring time spent on the CPU. All other PELT signals are based on time, this thermal thing isn't, so it is kinda different in a way. And I'm still wondering if it could be helpful to be able to have a different HL for that thermal signal. That would 'break' the nice maths properties we have, yes, but is it a problem or is it actually helpful to cope with the thermal characteristics of different platforms ? Thanks, Quentin
On Wed, 10 Oct 2018 at 11:55, Quentin Perret <quentin.perret@arm.com> wrote: > > Hi Vincent, > > On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote: > > The problem with reflecting directly the capping is that it happens > > far more often than the pace at which cpu_capacity_orig is updated in > > the scheduler. > > Hmm, how can you be so sure ? That most likely depends on the workload, > the platform and the thermal governor. Some platforms heat up slowly, > some quickly. The pace at which the thermal governor will change things > should depend on that I assume. > > > This means that at the moment when scheduler uses the > > value, it might not be correct anymore. > > And OTOH, when you remove a cap for example, it will take time before > the scheduler can see the newly available capacity if you need to wait > for the signal to decay. So you are using a wrong information too in > that scenario. But we stay consistant with all other metrics. The same happen when a task decide to stay idle for a long time after some activity... You will have to wait for the signal to decay > > > Then, this value are also used > > when building the sched_domain and setting max_cpu_capacity which > > would also implies the rebuilt the sched_domain topology ... > > Wait what ? I thought the thermal cap was reflected in capacity_of, not > capacity_orig_of ... You need to rebuild the sched_domain in case of > thermal pressure ? > > Hmm, let me have a closer look at the patches, I must have missed > something ... > > > The pace of changing the capping is to fast to reflect that in the > > whole scheduler topology > > That's probably true in some cases, but it'd be cool to have numbers to > back up that statement, I think. > > Now, if you do need to rebuild the sched domain topology every time you > update the thermal pressure, I think the PELT HL is _way_ too short for > that ... You can't rebuild the whole thing every 32ms or so. Or am I > misunderstanding something ? > > > > Thara, have you tried to experiment with a simpler implementation as > > > suggested by Ingo ? > > > > > > Also, assuming that we do want to average things, do we actually want to > > > tie the thermal ramp-up time to the PELT half life ? That provides > > > nice maths properties wrt the other signals, but it's not obvious to me > > > that this thermal 'constant' should be the same on all platforms. Or > > > maybe it should ? > > > > The main interest of using PELT signal is that thermal pressure will > > evolve at the same pace as other signals used in the scheduler. > > Right, I think this is a nice property too (assuming that we actually > want to average things out). > > > With > > thermal pressure, we have the exact same problem as with RT tasks. The > > thermal will cap the max frequency which will cap the utilization of > > the tasks running on the CPU > > Well, the nature of the signal is slightly different IMO. Yes it's > capacity, but you're no actually measuring time spent on the CPU. All > other PELT signals are based on time, this thermal thing isn't, so it is > kinda different in a way. And I'm still wondering if it could be helpful hmmm ... it is based on time too. Both signals (current ones and thermal one) are really close. The main difference with other utilization signal is that instead of providing a running/not running boolean that is then weighted by the current capacity, the signal uses direclty the capped max capacity to reflect the amount of cycle that is stolen by thermal mitigation. > to be able to have a different HL for that thermal signal. That would > 'break' the nice maths properties we have, yes, but is it a problem or is > it actually helpful to cope with the thermal characteristics of > different platforms ? If you don't use the sign kind of signal with the same responsiveness, you will start to see some OPP toggles as an example when the thermal state change because one metrics will change faster than the other one and you can't have a correct view of the system. Same problem was happening with rt task. I take the example of RT task because it quite similar in the effect except that RT task steal all cycles when it runs whereas thermal mitigation steal only some by capping the frequency > > Thanks, > Quentin
On Wed, 10 Oct 2018 at 12:36, Quentin Perret <quentin.perret@arm.com> wrote: > > On Wednesday 10 Oct 2018 at 12:14:32 (+0200), Vincent Guittot wrote: > > On Wed, 10 Oct 2018 at 11:55, Quentin Perret <quentin.perret@arm.com> wrote: > > > > > > Hi Vincent, > > > > > > On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote: > > > > The problem with reflecting directly the capping is that it happens > > > > far more often than the pace at which cpu_capacity_orig is updated in > > > > the scheduler. > > > > > > Hmm, how can you be so sure ? That most likely depends on the workload, > > > the platform and the thermal governor. Some platforms heat up slowly, > > > some quickly. The pace at which the thermal governor will change things > > > should depend on that I assume. > > > > > > > This means that at the moment when scheduler uses the > > > > value, it might not be correct anymore. > > > > > > And OTOH, when you remove a cap for example, it will take time before > > > the scheduler can see the newly available capacity if you need to wait > > > for the signal to decay. So you are using a wrong information too in > > > that scenario. > > > > But we stay consistant with all other metrics. The same happen when a > > task decide to stay idle for a long time after some activity... You > > will have to wait for the signal to decay > > Because you see that as a task going idle. You could also say that > removing the cap from a CPU is equivalent to migrating a task off that > CPU. In this case you should see the newly available cap pretty much > immediately. > > Also, I feel like the whole thermal capping story could interest the > Intel folks who, IIRC, were discussing how to represent their 'boosted' > OPPs in the scheduler some time ago. You can see the Intel boost thing > as a cap I think -- some OPPs can be inaccessible in some case. I'd be > interested to see what's their take on this. > > > > > Then, this value are also used > > > > when building the sched_domain and setting max_cpu_capacity which > > > > would also implies the rebuilt the sched_domain topology ... > > > > > > Wait what ? I thought the thermal cap was reflected in capacity_of, not > > > capacity_orig_of ... You need to rebuild the sched_domain in case of > > > thermal pressure ? > > I can't locate where you need to do this in the series. Do you actually > need to rebuild the sd hierarchy ? This patchset doesn't touch cpu_capacity_orig and doesn't need to as it assume that the max capacity is unchanged but some capacity is momentary stolen by thermal. If you want to reflect immediately all thermal capping change, you have to update this field and all related fields and struct around > > > > Hmm, let me have a closer look at the patches, I must have missed > > > something ... > > > > > > > The pace of changing the capping is to fast to reflect that in the > > > > whole scheduler topology > > > > > > That's probably true in some cases, but it'd be cool to have numbers to > > > back up that statement, I think. > > > > > > Now, if you do need to rebuild the sched domain topology every time you > > > update the thermal pressure, I think the PELT HL is _way_ too short for > > > that ... You can't rebuild the whole thing every 32ms or so. Or am I > > > misunderstanding something ? > > > > > > > > Thara, have you tried to experiment with a simpler implementation as > > > > > suggested by Ingo ? > > > > > > > > > > Also, assuming that we do want to average things, do we actually want to > > > > > tie the thermal ramp-up time to the PELT half life ? That provides > > > > > nice maths properties wrt the other signals, but it's not obvious to me > > > > > that this thermal 'constant' should be the same on all platforms. Or > > > > > maybe it should ? > > > > > > > > The main interest of using PELT signal is that thermal pressure will > > > > evolve at the same pace as other signals used in the scheduler. > > > > > > Right, I think this is a nice property too (assuming that we actually > > > want to average things out). > > > > > > > With > > > > thermal pressure, we have the exact same problem as with RT tasks. The > > > > thermal will cap the max frequency which will cap the utilization of > > > > the tasks running on the CPU > > > > > > Well, the nature of the signal is slightly different IMO. Yes it's > > > capacity, but you're no actually measuring time spent on the CPU. All > > > other PELT signals are based on time, this thermal thing isn't, so it is > > > kinda different in a way. And I'm still wondering if it could be helpful > > > > hmmm ... it is based on time too. > > You're not actually measuring the time spent on the CPU by the 'thermal > task'. There is no such thing as a 'thermal task'. You're just trying to > model things like that, but the thermal stuff isn't actually > interrupting running tasks to eat CPU cycles. It just makes thing run > slower, which isn't exactly the same thing IMO. > > But maybe that's a detail. > > > Both signals (current ones and thermal one) are really close. The main > > difference with other utilization signal is that instead of providing > > a running/not running boolean that is then weighted by the current > > capacity, the signal uses direclty the capped max capacity to reflect > > the amount of cycle that is stolen by thermal mitigation. > > > > > to be able to have a different HL for that thermal signal. That would > > > 'break' the nice maths properties we have, yes, but is it a problem or is > > > it actually helpful to cope with the thermal characteristics of > > > different platforms ? > > > > If you don't use the sign kind of signal with the same responsiveness, > > you will start to see some OPP toggles as an example when the thermal > > state change because one metrics will change faster than the other one > > and you can't have a correct view of the system. Same problem was > > happening with rt task. > > Well, that wasn't the problem with rt tasks. The problem with RT tasks > was that the time they spend on the CPU wasn't accounted _at all_ when > selecting frequency for CFS, not that the accounting was at a different > pace ... The problem was the same with RT, the cfs utilization was lower than reality because RT steals soem cycle to CFS So schedutil was selecting a lower frequency when cfs was running whereas the CPU was fully used. The same can happen with thermal: cap the max freq because of thermal the utilization with decrease. remove the cap the utilization is still low and you will select a low OPP because you don't take into account cycle stolen by thermal like with RT Regards, Vincent > > Thanks, > Quentin
On 10/10/18 14:04, Vincent Guittot wrote: [...] > The problem was the same with RT, the cfs utilization was lower than > reality because RT steals soem cycle to CFS > So schedutil was selecting a lower frequency when cfs was running > whereas the CPU was fully used. > The same can happen with thermal: > cap the max freq because of thermal > the utilization with decrease. > remove the cap > the utilization is still low and you will select a low OPP because you > don't take into account cycle stolen by thermal like with RT What if we scale frequency component considering the capped temporary max?
On 10/10/18 14:34, Vincent Guittot wrote: > Hi Juri, > > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote: > > > > On 10/10/18 14:04, Vincent Guittot wrote: > > > > [...] > > > > > The problem was the same with RT, the cfs utilization was lower than > > > reality because RT steals soem cycle to CFS > > > So schedutil was selecting a lower frequency when cfs was running > > > whereas the CPU was fully used. > > > The same can happen with thermal: > > > cap the max freq because of thermal > > > the utilization with decrease. > > > remove the cap > > > the utilization is still low and you will select a low OPP because you > > > don't take into account cycle stolen by thermal like with RT > > > > What if we scale frequency component considering the capped temporary > > max? > > Do you mean using a kind of scale_thermal_capacity in accumulate_sum > when computing utilization ? Yeah, something like that I guess. So that we account for temporary "fake" 1024..
On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote: > This patchset doesn't touch cpu_capacity_orig and doesn't need to as > it assume that the max capacity is unchanged but some capacity is > momentary stolen by thermal. > If you want to reflect immediately all thermal capping change, you > have to update this field and all related fields and struct around I don't follow you here. I never said I wanted to change cpu_capacity_orig. I don't think we should do that actually. Changing capacity_of (which is updated during LB IIRC) is just fine. The question is about what you want to do there: reflect an averaged value or the instantaneous one. It's not obvious (to me) that the complex one (the averaged value) is better than the other, simpler, one. All I'm saying from the beginning is that it'd be nice to have numbers and use cases to discuss the pros and cons of both approaches. > > > > Hmm, let me have a closer look at the patches, I must have missed > > > > something ... > > > > > > > > > The pace of changing the capping is to fast to reflect that in the > > > > > whole scheduler topology > > > > > > > > That's probably true in some cases, but it'd be cool to have numbers to > > > > back up that statement, I think. > > > > > > > > Now, if you do need to rebuild the sched domain topology every time you > > > > update the thermal pressure, I think the PELT HL is _way_ too short for > > > > that ... You can't rebuild the whole thing every 32ms or so. Or am I > > > > misunderstanding something ? > > > > > > > > > > Thara, have you tried to experiment with a simpler implementation as > > > > > > suggested by Ingo ? > > > > > > > > > > > > Also, assuming that we do want to average things, do we actually want to > > > > > > tie the thermal ramp-up time to the PELT half life ? That provides > > > > > > nice maths properties wrt the other signals, but it's not obvious to me > > > > > > that this thermal 'constant' should be the same on all platforms. Or > > > > > > maybe it should ? > > > > > > > > > > The main interest of using PELT signal is that thermal pressure will > > > > > evolve at the same pace as other signals used in the scheduler. > > > > > > > > Right, I think this is a nice property too (assuming that we actually > > > > want to average things out). > > > > > > > > > With > > > > > thermal pressure, we have the exact same problem as with RT tasks. The > > > > > thermal will cap the max frequency which will cap the utilization of > > > > > the tasks running on the CPU > > > > > > > > Well, the nature of the signal is slightly different IMO. Yes it's > > > > capacity, but you're no actually measuring time spent on the CPU. All > > > > other PELT signals are based on time, this thermal thing isn't, so it is > > > > kinda different in a way. And I'm still wondering if it could be helpful > > > > > > hmmm ... it is based on time too. > > > > You're not actually measuring the time spent on the CPU by the 'thermal > > task'. There is no such thing as a 'thermal task'. You're just trying to > > model things like that, but the thermal stuff isn't actually > > interrupting running tasks to eat CPU cycles. It just makes thing run > > slower, which isn't exactly the same thing IMO. > > > > But maybe that's a detail. > > > > > Both signals (current ones and thermal one) are really close. The main > > > difference with other utilization signal is that instead of providing > > > a running/not running boolean that is then weighted by the current > > > capacity, the signal uses direclty the capped max capacity to reflect > > > the amount of cycle that is stolen by thermal mitigation. > > > > > > > to be able to have a different HL for that thermal signal. That would > > > > 'break' the nice maths properties we have, yes, but is it a problem or is > > > > it actually helpful to cope with the thermal characteristics of > > > > different platforms ? > > > > > > If you don't use the sign kind of signal with the same responsiveness, > > > you will start to see some OPP toggles as an example when the thermal > > > state change because one metrics will change faster than the other one > > > and you can't have a correct view of the system. Same problem was > > > happening with rt task. > > > > Well, that wasn't the problem with rt tasks. The problem with RT tasks > > was that the time they spend on the CPU wasn't accounted _at all_ when > > selecting frequency for CFS, not that the accounting was at a different > > pace ... > > The problem was the same with RT, the cfs utilization was lower than > reality because RT steals soem cycle to CFS > So schedutil was selecting a lower frequency when cfs was running > whereas the CPU was fully used. > The same can happen with thermal: > cap the max freq because of thermal > the utilization with decrease. > remove the cap > the utilization is still low and you will select a low OPP because you > don't take into account cycle stolen by thermal like with RT I'm not arguing with the fact that we need to reflect the thermal pressure in the scheduler to see that a CPU is fully busy. I agree with that. I'm saying you don't necessarily have to update the thermal stuff and the existing PELT signals *at the same pace*, because different platforms have different thermal characteristics. Thanks,* Quentin
On Wed, 10 Oct 2018 at 14:50, Juri Lelli <juri.lelli@gmail.com> wrote: > > On 10/10/18 14:34, Vincent Guittot wrote: > > Hi Juri, > > > > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote: > > > > > > On 10/10/18 14:04, Vincent Guittot wrote: > > > > > > [...] > > > > > > > The problem was the same with RT, the cfs utilization was lower than > > > > reality because RT steals soem cycle to CFS > > > > So schedutil was selecting a lower frequency when cfs was running > > > > whereas the CPU was fully used. > > > > The same can happen with thermal: > > > > cap the max freq because of thermal > > > > the utilization with decrease. > > > > remove the cap > > > > the utilization is still low and you will select a low OPP because you > > > > don't take into account cycle stolen by thermal like with RT > > > > > > What if we scale frequency component considering the capped temporary > > > max? > > > > Do you mean using a kind of scale_thermal_capacity in accumulate_sum > > when computing utilization ? > > Yeah, something like that I guess. So that we account for temporary > "fake" 1024.. But the utilization will not be invariant anymore across the system
On Wednesday 10 Oct 2018 at 14:50:33 (+0200), Juri Lelli wrote: > On 10/10/18 14:34, Vincent Guittot wrote: > > Hi Juri, > > > > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote: > > > > > > On 10/10/18 14:04, Vincent Guittot wrote: > > > > > > [...] > > > > > > > The problem was the same with RT, the cfs utilization was lower than > > > > reality because RT steals soem cycle to CFS > > > > So schedutil was selecting a lower frequency when cfs was running > > > > whereas the CPU was fully used. > > > > The same can happen with thermal: > > > > cap the max freq because of thermal > > > > the utilization with decrease. > > > > remove the cap > > > > the utilization is still low and you will select a low OPP because you > > > > don't take into account cycle stolen by thermal like with RT > > > > > > What if we scale frequency component considering the capped temporary > > > max? > > > > Do you mean using a kind of scale_thermal_capacity in accumulate_sum > > when computing utilization ? > > Yeah, something like that I guess. So that we account for temporary > "fake" 1024.. But wouldn't that break frequency invariance ? A task would look bigger on a capped CPU than a non-capped one no ?
On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote: > > On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote: > > This patchset doesn't touch cpu_capacity_orig and doesn't need to as > > it assume that the max capacity is unchanged but some capacity is > > momentary stolen by thermal. > > If you want to reflect immediately all thermal capping change, you > > have to update this field and all related fields and struct around > > I don't follow you here. I never said I wanted to change > cpu_capacity_orig. I don't think we should do that actually. Changing > capacity_of (which is updated during LB IIRC) is just fine. The question > is about what you want to do there: reflect an averaged value or the > instantaneous one. Sorry I though your were speaking about updating this cpu_capacity_orig. With using instantaneous max value in capacity_of(), we are back to the problem raised by Thara that the value will most probably not reflect the current capping value when it is used in LB, because LB period can quite long on busy CPU (default max value is 32*sd_weight ms) > > It's not obvious (to me) that the complex one (the averaged value) is > better than the other, simpler, one. All I'm saying from the beginning > is that it'd be nice to have numbers and use cases to discuss the pros > and cons of both approaches. > > > > > > Hmm, let me have a closer look at the patches, I must have missed > > > > > something ... > > > > > > > > > > > The pace of changing the capping is to fast to reflect that in the > > > > > > whole scheduler topology > > > > > [snip] > > > > > > Well, that wasn't the problem with rt tasks. The problem with RT tasks > > > was that the time they spend on the CPU wasn't accounted _at all_ when > > > selecting frequency for CFS, not that the accounting was at a different > > > pace ... > > > > The problem was the same with RT, the cfs utilization was lower than > > reality because RT steals soem cycle to CFS > > So schedutil was selecting a lower frequency when cfs was running > > whereas the CPU was fully used. > > The same can happen with thermal: > > cap the max freq because of thermal > > the utilization with decrease. > > remove the cap > > the utilization is still low and you will select a low OPP because you > > don't take into account cycle stolen by thermal like with RT > > I'm not arguing with the fact that we need to reflect the thermal > pressure in the scheduler to see that a CPU is fully busy. I agree with > that. > > I'm saying you don't necessarily have to update the thermal stuff and > the existing PELT signals *at the same pace*, because different > platforms have different thermal characteristics. But you also need to take into account how fast other metrics in the scheduler are updated otherwise a metric will reflect a change not already reflected in the other metrics and you might take wrong decision as my example above where utilization is still low but thermal pressure is nul and you assume that you have lot of spare capacity Having metrics that use same responsiveness and are synced, help to get a consolidated view of the system. Vincent > > Thanks,* > Quentin
On 10/10/18 15:08, Vincent Guittot wrote: > On Wed, 10 Oct 2018 at 14:50, Juri Lelli <juri.lelli@gmail.com> wrote: > > > > On 10/10/18 14:34, Vincent Guittot wrote: > > > Hi Juri, > > > > > > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote: > > > > > > > > On 10/10/18 14:04, Vincent Guittot wrote: > > > > > > > > [...] > > > > > > > > > The problem was the same with RT, the cfs utilization was lower than > > > > > reality because RT steals soem cycle to CFS > > > > > So schedutil was selecting a lower frequency when cfs was running > > > > > whereas the CPU was fully used. > > > > > The same can happen with thermal: > > > > > cap the max freq because of thermal > > > > > the utilization with decrease. > > > > > remove the cap > > > > > the utilization is still low and you will select a low OPP because you > > > > > don't take into account cycle stolen by thermal like with RT > > > > > > > > What if we scale frequency component considering the capped temporary > > > > max? > > > > > > Do you mean using a kind of scale_thermal_capacity in accumulate_sum > > > when computing utilization ? > > > > Yeah, something like that I guess. So that we account for temporary > > "fake" 1024.. > > But the utilization will not be invariant anymore across the system Mmm, I guess I might be wrong, but I was thinking we should be able to deal with this similarly to what we do with cpus with different max capacities. So, another factor? Because then, how do we handle other ways in which max freq can be restricted (e.g. from userspace as Javi was also mentioning)?
On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote: > On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote: > > > > On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote: > > > This patchset doesn't touch cpu_capacity_orig and doesn't need to as > > > it assume that the max capacity is unchanged but some capacity is > > > momentary stolen by thermal. > > > If you want to reflect immediately all thermal capping change, you > > > have to update this field and all related fields and struct around > > > > I don't follow you here. I never said I wanted to change > > cpu_capacity_orig. I don't think we should do that actually. Changing > > capacity_of (which is updated during LB IIRC) is just fine. The question > > is about what you want to do there: reflect an averaged value or the > > instantaneous one. > > Sorry I though your were speaking about updating this cpu_capacity_orig. N/p, communication via email can easily become confusing :-) > With using instantaneous max value in capacity_of(), we are back to > the problem raised by Thara that the value will most probably not > reflect the current capping value when it is used in LB, because LB > period can quite long on busy CPU (default max value is 32*sd_weight > ms) But averaging the capping value over time doesn't make LB happen more often ... That will help you account for capping that happened in the past, but it's not obvious this is actually a good thing. Probably not all the time anyway. Say a CPU was capped at 50% of it's capacity, then the cap is removed. At that point it'll take 100ms+ for the thermal signal to decay and let the scheduler know about the newly available capacity. That can probably be a performance hit in some use cases ... And the other way around, it can also take forever for the scheduler to notice that a CPU has a reduced capacity before reacting to it. If you want to filter out very short transient capping events to avoid over-reacting in the scheduler (is this actually happening ?), then maybe the average should be done on the cooling device side or something like that ? Thanks, Quentin
Hello Javi, Thanks for the interest. On 10/10/2018 01:44 AM, Javi Merino wrote: > On Tue, Oct 09, 2018 at 12:24:55PM -0400, Thara Gopinath wrote: >> Thermal governors can respond to an overheat event for a cpu by >> capping the cpu's maximum possible frequency. This in turn >> means that the maximum available compute capacity of the >> cpu is restricted. But today in linux kernel, in event of maximum >> frequency capping of a cpu, the maximum available compute >> capacity of the cpu is not adjusted at all. In other words, scheduler >> is unware maximum cpu capacity restrictions placed due to thermal >> activity. > > Interesting, I would have sworn that I tested this years ago by > lowering the maximum frequency of a cpufreq domain, and the scheduler > reacted accordingly to the new maximum capacities of the cpus. > >> This patch series attempts to address this issue. >> The benefits identified are better task placement among available >> cpus in event of overheating which in turn leads to better >> performance numbers. >> >> The delta between the maximum possible capacity of a cpu and >> maximum available capacity of a cpu due to thermal event can >> be considered as thermal pressure. Instantaneous thermal pressure >> is hard to record and can sometime be erroneous as there can be mismatch >> between the actual capping of capacity and scheduler recording it. >> Thus solution is to have a weighted average per cpu value for thermal >> pressure over time. The weight reflects the amount of time the cpu has >> spent at a capped maximum frequency. To accumulate, average and >> appropriately decay thermal pressure, this patch series uses pelt >> signals and reuses the available framework that does a similar >> bookkeeping of rt/dl task utilization. >> >> Regarding testing, basic build, boot and sanity testing have been >> performed on hikey960 mainline kernel with debian file system. >> Further aobench (An occlusion renderer for benchmarking realworld >> floating point performance) showed the following results on hikey960 >> with debain. >> >> Result Standard Standard >> (Time secs) Error Deviation >> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% >> Hikey 960 - thermal pressure applied 122.37 5.78 11.57% >> >> Thara Gopinath (7): >> sched/pelt: Add option to make load and util calculations frequency >> invariant >> sched/pelt.c: Add support to track thermal pressure >> sched: Add infrastructure to store and update instantaneous thermal >> pressure >> sched: Initialize per cpu thermal pressure structure >> sched/fair: Enable CFS periodic tick to update thermal pressure >> sched/fair: update cpu_capcity to reflect thermal pressure >> thermal/cpu-cooling: Update thermal pressure in case of a maximum >> frequency capping >> >> drivers/base/arch_topology.c | 1 + >> drivers/thermal/cpu_cooling.c | 20 ++++++++++++- > > thermal? There are other ways in which the maximum frequency of a cpu > can be limited, for example from userspace via scaling_max_freq. Yes there are other ways in which maximum frequency of a cpu can be capped. The difference probably is that in case of a user-space cap, the time duration the cpu remains in the capped state is significantly higher than capping due to a thermal event. So may be the response of the scheduler should be different in that scenario (like rebuilding the capacities of all cpus etc). This patch series does not rebuild the scheduler structures. This just tells the scheduler that some cpu capacity is stolen. > > When something (anything) changes the maximum frequency of a cpufreq > policy, the scheduler should be notified. I think this change should > be done in cpufreq instead to make it generic and not particular to > a given maximum frequency "capper". I am glad to hear you say this! So far, all I have heard whenever I bring up this topic is issues with such an update from cpufreq(delays, lost updates etc). Personally, I have not seen these issues enough to comment on them. I will really like to hear more about these issues for an update from cpufreq here on the list. For me, the patch series is a mechanism to let scheduler know that a thermal event has stolen some cpu capacity. The update itself can happen from any framework which can track the event and we all mutually agree on. Regards Thara > > Cheers, > Javi > -- Regards Thara
On Wed, 10 Oct 2018 at 15:48, Quentin Perret <quentin.perret@arm.com> wrote: > > On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote: > > On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote: > > > > > > On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote: > > > > This patchset doesn't touch cpu_capacity_orig and doesn't need to as > > > > it assume that the max capacity is unchanged but some capacity is > > > > momentary stolen by thermal. > > > > If you want to reflect immediately all thermal capping change, you > > > > have to update this field and all related fields and struct around > > > > > > I don't follow you here. I never said I wanted to change > > > cpu_capacity_orig. I don't think we should do that actually. Changing > > > capacity_of (which is updated during LB IIRC) is just fine. The question > > > is about what you want to do there: reflect an averaged value or the > > > instantaneous one. > > > > Sorry I though your were speaking about updating this cpu_capacity_orig. > > N/p, communication via email can easily become confusing :-) > > > With using instantaneous max value in capacity_of(), we are back to > > the problem raised by Thara that the value will most probably not > > reflect the current capping value when it is used in LB, because LB > > period can quite long on busy CPU (default max value is 32*sd_weight > > ms) > > But averaging the capping value over time doesn't make LB happen more > often ... That will help you account for capping that happened in the But you know what happens in average between 2 LB > past, but it's not obvious this is actually a good thing. Probably not > all the time anyway. > > Say a CPU was capped at 50% of it's capacity, then the cap is removed. > At that point it'll take 100ms+ for the thermal signal to decay and let > the scheduler know about the newly available capacity. That can probably But the point is that you don't know: - if the capping will not happen soon. If the pressure has reached the 50%, it means that it already happened quite often in the past 100ms. - if there is really available capacity as the current sum of utilization reflects what was available for tasks and not what the tasks really wants to use > be a performance hit in some use cases ... And the other way around, it > can also take forever for the scheduler to notice that a CPU has a What do you mean by forever ? > reduced capacity before reacting to it. > > If you want to filter out very short transient capping events to avoid > over-reacting in the scheduler (is this actually happening ?), then > maybe the average should be done on the cooling device side or something > like that ? > > Thanks, > Quentin
Hi Thara, I have run it on Exynos5433 mainline. When it is enabled with step_wise thermal governor, some of my tests are showing ~30-50% regression (i.e. hackbench), dhrystone ~10%. Could you tell me which thermal governor was used in your case? Please also share the name of that benchmark, i will give it a try. Is it single threaded compute-intensive? Regards, Lukasz On 10/09/2018 06:24 PM, Thara Gopinath wrote: > Thermal governors can respond to an overheat event for a cpu by > capping the cpu's maximum possible frequency. This in turn > means that the maximum available compute capacity of the > cpu is restricted. But today in linux kernel, in event of maximum > frequency capping of a cpu, the maximum available compute > capacity of the cpu is not adjusted at all. In other words, scheduler > is unware maximum cpu capacity restrictions placed due to thermal > activity. This patch series attempts to address this issue. > The benefits identified are better task placement among available > cpus in event of overheating which in turn leads to better > performance numbers. > > The delta between the maximum possible capacity of a cpu and > maximum available capacity of a cpu due to thermal event can > be considered as thermal pressure. Instantaneous thermal pressure > is hard to record and can sometime be erroneous as there can be mismatch > between the actual capping of capacity and scheduler recording it. > Thus solution is to have a weighted average per cpu value for thermal > pressure over time. The weight reflects the amount of time the cpu has > spent at a capped maximum frequency. To accumulate, average and > appropriately decay thermal pressure, this patch series uses pelt > signals and reuses the available framework that does a similar > bookkeeping of rt/dl task utilization. > > Regarding testing, basic build, boot and sanity testing have been > performed on hikey960 mainline kernel with debian file system. > Further aobench (An occlusion renderer for benchmarking realworld > floating point performance) showed the following results on hikey960 > with debain. > > Result Standard Standard > (Time secs) Error Deviation > Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% > Hikey 960 - thermal pressure applied 122.37 5.78 11.57% > > Thara Gopinath (7): > sched/pelt: Add option to make load and util calculations frequency > invariant > sched/pelt.c: Add support to track thermal pressure > sched: Add infrastructure to store and update instantaneous thermal > pressure > sched: Initialize per cpu thermal pressure structure > sched/fair: Enable CFS periodic tick to update thermal pressure > sched/fair: update cpu_capcity to reflect thermal pressure > thermal/cpu-cooling: Update thermal pressure in case of a maximum > frequency capping > > drivers/base/arch_topology.c | 1 + > drivers/thermal/cpu_cooling.c | 20 ++++++++++++- > include/linux/sched.h | 14 +++++++++ > kernel/sched/Makefile | 2 +- > kernel/sched/core.c | 2 ++ > kernel/sched/fair.c | 4 +++ > kernel/sched/pelt.c | 40 ++++++++++++++++++-------- > kernel/sched/pelt.h | 7 +++++ > kernel/sched/sched.h | 1 + > kernel/sched/thermal.c | 66 +++++++++++++++++++++++++++++++++++++++++++ > kernel/sched/thermal.h | 13 +++++++++ > 11 files changed, 157 insertions(+), 13 deletions(-) > create mode 100644 kernel/sched/thermal.c > create mode 100644 kernel/sched/thermal.h >
Hello Ingo, Thank you for the review. On 10/10/2018 02:17 AM, Ingo Molnar wrote: > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > >> Thermal governors can respond to an overheat event for a cpu by >> capping the cpu's maximum possible frequency. This in turn >> means that the maximum available compute capacity of the >> cpu is restricted. But today in linux kernel, in event of maximum >> frequency capping of a cpu, the maximum available compute >> capacity of the cpu is not adjusted at all. In other words, scheduler >> is unware maximum cpu capacity restrictions placed due to thermal >> activity. This patch series attempts to address this issue. >> The benefits identif ied are better task placement among available >> cpus in event of overheating which in turn leads to better >> performance numbers. >> >> The delta between the maximum possible capacity of a cpu and >> maximum available capacity of a cpu due to thermal event can >> be considered as thermal pressure. Instantaneous thermal pressure >> is hard to record and can sometime be erroneous as there can be mismatch >> between the actual capping of capacity and scheduler recording it. >> Thus solution is to have a weighted average per cpu value for thermal >> pressure over time. The weight reflects the amount of time the cpu has >> spent at a capped maximum frequency. To accumulate, average and >> appropriately decay thermal pressure, this patch series uses pelt >> signals and reuses the available framework that does a similar >> bookkeeping of rt/dl task utilization. >> >> Regarding testing, basic build, boot and sanity testing have been >> performed on hikey960 mainline kernel with debian file system. >> Further aobench (An occlusion renderer for benchmarking realworld >> floating point performance) showed the following results on hikey960 >> with debain. >> >> Result Standard Standard >> (Time secs) Error Deviation >> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% >> Hikey 960 - thermal pressure applied 122.37 5.78 11.57% > > Wow, +13% speedup, impressive! We definitely want this outcome. > > I'm wondering what happens if we do not track and decay the thermal load at all at the PELT > level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal > events we receive from the CPU. The problem with instantaneous update is that sometimes thermal events happen at a much faster pace than cpu_capacity is updated in the scheduler. This means that at the moment when scheduler uses the value, it might not be correct anymore. Having said that, today Android common kernel has a solution which instantaneously updates cpu_capacity in case of a thermal event. To give a bit of background on the evolution of the solution I have proposed, below is a time line of analysis I have done. 1. I started this activity by analyzing the existing framework on android common kernel. I ran android benchmark tests (Jankbench, Vellamo, Geekbench) with and without the existing instantaneous update mechanism. I found that there is no real performance difference to be observed with an instantaneous updated of cpu_capacity at least in my test scenarios. 2. Then I developed an algorithm to track, accumulate and decay the capacity capping i.e an algorithm without using the pelt signals(this was prior to the new pelt framework in mainline). With this android benchmarks showed performance improvements. At this point I also ported the solution to mainline kernel and ran the aobench analysis which again showed a performance improvement. 3. Finally with the new pelt framework in place, I replaced my algorithm with the one used for rt and dl utilization tracking which is the current patch series. I have not been able to run tests with this on Android yet. All tests were performed on hikey960. I have a Google spreadsheet, documenting results at various stages of analysis. I am not sure how to share it with the group here. > > You describe the averaging as: > >> Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can >> be mismatch between the actual capping of capacity and scheduler recording it. > > Not sure I follow the argument here: are there bogus thermal throttling events? If so then > they are hopefully not frequent enough and should average out over time even if we follow > it instantly. No bogus events. It is more like sometimes capping happens at a much faster rate than cpu_capacity is updated and the scheduler looses these events. > > I.e. what is 'can sometimes be erroneous', exactly? > > Thanks, > > Ingo > -- Regards Thara
Hi guys, On 10/10/18 14:47, Quentin Perret wrote: > On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote: >> On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote: >>> >>> On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote: >>>> This patchset doesn't touch cpu_capacity_orig and doesn't need to as >>>> it assume that the max capacity is unchanged but some capacity is >>>> momentary stolen by thermal. >>>> If you want to reflect immediately all thermal capping change, you >>>> have to update this field and all related fields and struct around >>> >>> I don't follow you here. I never said I wanted to change >>> cpu_capacity_orig. I don't think we should do that actually. Changing >>> capacity_of (which is updated during LB IIRC) is just fine. The question >>> is about what you want to do there: reflect an averaged value or the >>> instantaneous one. >> >> Sorry I though your were speaking about updating this cpu_capacity_orig. > > N/p, communication via email can easily become confusing :-) > >> With using instantaneous max value in capacity_of(), we are back to >> the problem raised by Thara that the value will most probably not >> reflect the current capping value when it is used in LB, because LB >> period can quite long on busy CPU (default max value is 32*sd_weight >> ms) > > But averaging the capping value over time doesn't make LB happen more > often ... That will help you account for capping that happened in the > past, but it's not obvious this is actually a good thing. Probably not > all the time anyway. > > Say a CPU was capped at 50% of it's capacity, then the cap is removed. > At that point it'll take 100ms+ for the thermal signal to decay and let > the scheduler know about the newly available capacity. That can probably > be a performance hit in some use cases ... And the other way around, it > can also take forever for the scheduler to notice that a CPU has a > reduced capacity before reacting to it. > > If you want to filter out very short transient capping events to avoid > over-reacting in the scheduler (is this actually happening ?), then > maybe the average should be done on the cooling device side or something > like that ? > I think there isn't just the issue of the *occasional* overreaction of a thermal governor due to noise in the temperature sensors or some spike in environmental temperature that determines a delayed reaction in the scheduler due to when capacity is updated. I'm seeing a bigger issue for *sustained* high temperatures that are not treated effectively by governors. Depending on the platform, heat can be dissipated over longer or shorter periods of time. This can determine a seesaw effect on the maximum frequency: it determines the temperature is over a threshold and it starts capping, but heat is not dissipated quickly enough for that to reflect in the value of the temperature sensor, so it continues to cap; when the temperature gets to normal, capping is lifted, which in turn results access to higher OPPs and a return to high temperatures, etc. What will happen is that, *depending on platform* and the moment when capacity is updated, you can see either a CPU with a capacity of 1024, or let's say 800, or (on hikey960 :)) around 500, and back and forth between them. Because of these I tend to think that a regulated (averaged) value of thermal pressure is better than an instantaneous one. Thermal mitigation measures are there for the well-being and safety of a device, not for optimizations so it can and should be allowed to overreact, or have a delayed reaction. But ping-pong-ing tasks back and forth between CPUs due to changes in CPU capacity is harmful for performance. What would be awesome to achieve with this is (close to) optimal use of restricted capacities of CPUs, and I tend to believe instantaneous and most probably out of date capacity values would not lead to this. But this is almost a gut feeling and of course it should be validated on devices with different thermal characteristics. Given the high variation between devices with regards to this I'd be reluctant to tie it to the PELT half life. Regards, Ionela. > Thanks, > Quentin >
On 10/10/2018 17:35, Lukasz Luba wrote: > Hi Thara, > > I have run it on Exynos5433 mainline. > When it is enabled with step_wise thermal governor, > some of my tests are showing ~30-50% regression (i.e. hackbench), > dhrystone ~10%. > > Could you tell me which thermal governor was used in your case? > Please also share the name of that benchmark, i will give it a try. > Is it single threaded compute-intensive? aobench AFAICT It would be interesting if you can share the thermal profile of your board. > On 10/09/2018 06:24 PM, Thara Gopinath wrote: >> Thermal governors can respond to an overheat event for a cpu by >> capping the cpu's maximum possible frequency. This in turn >> means that the maximum available compute capacity of the >> cpu is restricted. But today in linux kernel, in event of maximum >> frequency capping of a cpu, the maximum available compute >> capacity of the cpu is not adjusted at all. In other words, scheduler >> is unware maximum cpu capacity restrictions placed due to thermal >> activity. This patch series attempts to address this issue. >> The benefits identified are better task placement among available >> cpus in event of overheating which in turn leads to better >> performance numbers. >> >> The delta between the maximum possible capacity of a cpu and >> maximum available capacity of a cpu due to thermal event can >> be considered as thermal pressure. Instantaneous thermal pressure >> is hard to record and can sometime be erroneous as there can be mismatch >> between the actual capping of capacity and scheduler recording it. >> Thus solution is to have a weighted average per cpu value for thermal >> pressure over time. The weight reflects the amount of time the cpu has >> spent at a capped maximum frequency. To accumulate, average and >> appropriately decay thermal pressure, this patch series uses pelt >> signals and reuses the available framework that does a similar >> bookkeeping of rt/dl task utilization. >> >> Regarding testing, basic build, boot and sanity testing have been >> performed on hikey960 mainline kernel with debian file system. >> Further aobench (An occlusion renderer for benchmarking realworld >> floating point performance) showed the following results on hikey960 >> with debain. >> >> Result Standard Standard >> (Time secs) Error Deviation >> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% >> Hikey 960 - thermal pressure applied 122.37 5.78 11.57% >> >> Thara Gopinath (7): >> sched/pelt: Add option to make load and util calculations frequency >> invariant >> sched/pelt.c: Add support to track thermal pressure >> sched: Add infrastructure to store and update instantaneous thermal >> pressure >> sched: Initialize per cpu thermal pressure structure >> sched/fair: Enable CFS periodic tick to update thermal pressure >> sched/fair: update cpu_capcity to reflect thermal pressure >> thermal/cpu-cooling: Update thermal pressure in case of a maximum >> frequency capping >> >> drivers/base/arch_topology.c | 1 + >> drivers/thermal/cpu_cooling.c | 20 ++++++++++++- >> include/linux/sched.h | 14 +++++++++ >> kernel/sched/Makefile | 2 +- >> kernel/sched/core.c | 2 ++ >> kernel/sched/fair.c | 4 +++ >> kernel/sched/pelt.c | 40 ++++++++++++++++++-------- >> kernel/sched/pelt.h | 7 +++++ >> kernel/sched/sched.h | 1 + >> kernel/sched/thermal.c | 66 +++++++++++++++++++++++++++++++++++++++++++ >> kernel/sched/thermal.h | 13 +++++++++ >> 11 files changed, 157 insertions(+), 13 deletions(-) >> create mode 100644 kernel/sched/thermal.c >> create mode 100644 kernel/sched/thermal.h >> -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog
Hello Quentin, On 10/10/2018 05:55 AM, Quentin Perret wrote: > Hi Vincent, > > On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote: >> The problem with reflecting directly the capping is that it happens >> far more often than the pace at which cpu_capacity_orig is updated in >> the scheduler. > > Hmm, how can you be so sure ? That most likely depends on the workload, > the platform and the thermal governor. Some platforms heat up slowly, > some quickly. The pace at which the thermal governor will change things > should depend on that I assume. Yes I agree. How often a thermal event occurs is indeed dependent on workload, platform and governors. For e.g. hikey960 the same workload displays different behavior with and without a fan. What we want is a generic solution that will work across "spiky" events and more spread out events. An instantaneous update of cpu_capacity most certainly does not capture spiky events. Also it can lead to cpu_capacity swinging wildly from the original value to a much lower value and back again. Averaging of thermal capacity limitation (Across any pre-determined duration i.e using Pelt signals or without) takes care of smoothening the effect of thermal capping and making sure that capping events are not entirely missed. For me it is a much more elegant solution than an instantaneous solution. Having said that, like you mentioned below, when the thermal capping goes away, it is not reflected as an instantaneous availability of spare capacity. It is slowly increased. But it is not necessarily wrong information. All it tells the scheduler is that during the last "pre-determined" duration on an average "X" was the capacity available for a CFS task on this cpu. > >> This means that at the moment when scheduler uses the >> value, it might not be correct anymore. > > And OTOH, when you remove a cap for example, it will take time before > the scheduler can see the newly available capacity if you need to wait > for the signal to decay. So you are using a wrong information too in > that scenario. > >> Then, this value are also used >> when building the sched_domain and setting max_cpu_capacity which >> would also implies the rebuilt the sched_domain topology ... > > Wait what ? I thought the thermal cap was reflected in capacity_of, not > capacity_orig_of ... You need to rebuild the sched_domain in case of > thermal pressure ? > > Hmm, let me have a closer look at the patches, I must have missed > something ... > >> The pace of changing the capping is to fast to reflect that in the >> whole scheduler topology > > That's probably true in some cases, but it'd be cool to have numbers to > back up that statement, I think. > > Now, if you do need to rebuild the sched domain topology every time you > update the thermal pressure, I think the PELT HL is _way_ too short for > that ... You can't rebuild the whole thing every 32ms or so. Or am I > misunderstanding something ? > >>> Thara, have you tried to experiment with a simpler implementation as >>> suggested by Ingo ? >>> >>> Also, assuming that we do want to average things, do we actually want to >>> tie the thermal ramp-up time to the PELT half life ? That provides >>> nice maths properties wrt the other signals, but it's not obvious to me >>> that this thermal 'constant' should be the same on all platforms. Or >>> maybe it should ? >> >> The main interest of using PELT signal is that thermal pressure will >> evolve at the same pace as other signals used in the scheduler. > > Right, I think this is a nice property too (assuming that we actually > want to average things out). > >> With >> thermal pressure, we have the exact same problem as with RT tasks. The >> thermal will cap the max frequency which will cap the utilization of >> the tasks running on the CPU > > Well, the nature of the signal is slightly different IMO. Yes it's > capacity, but you're no actually measuring time spent on the CPU. All > other PELT signals are based on time, this thermal thing isn't, so it is > kinda different in a way. And I'm still wondering if it could be helpful > to be able to have a different HL for that thermal signal. That would > 'break' the nice maths properties we have, yes, but is it a problem or is > it actually helpful to cope with the thermal characteristics of > different platforms ? > > Thanks, > Quentin > -- Regards Thara
Hi Daniel, On 10/10/2018 06:54 PM, Daniel Lezcano wrote: > On 10/10/2018 17:35, Lukasz Luba wrote: >> Hi Thara, >> >> I have run it on Exynos5433 mainline. >> When it is enabled with step_wise thermal governor, >> some of my tests are showing ~30-50% regression (i.e. hackbench), >> dhrystone ~10%. >> >> Could you tell me which thermal governor was used in your case? >> Please also share the name of that benchmark, i will give it a try. >> Is it single threaded compute-intensive? > > aobench AFAICT > > It would be interesting if you can share the thermal profile of your board. > Thanks for the benchmark name. It was tested on Samsung TM2 device with Exynos 5433 with debian. Thermal stuff you can find in mainline: arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi Regards, Lukasz > >> On 10/09/2018 06:24 PM, Thara Gopinath wrote: >>> Thermal governors can respond to an overheat event for a cpu by >>> capping the cpu's maximum possible frequency. This in turn >>> means that the maximum available compute capacity of the >>> cpu is restricted. But today in linux kernel, in event of maximum >>> frequency capping of a cpu, the maximum available compute >>> capacity of the cpu is not adjusted at all. In other words, scheduler >>> is unware maximum cpu capacity restrictions placed due to thermal >>> activity. This patch series attempts to address this issue. >>> The benefits identified are better task placement among available >>> cpus in event of overheating which in turn leads to better >>> performance numbers. >>> >>> The delta between the maximum possible capacity of a cpu and >>> maximum available capacity of a cpu due to thermal event can >>> be considered as thermal pressure. Instantaneous thermal pressure >>> is hard to record and can sometime be erroneous as there can be mismatch >>> between the actual capping of capacity and scheduler recording it. >>> Thus solution is to have a weighted average per cpu value for thermal >>> pressure over time. The weight reflects the amount of time the cpu has >>> spent at a capped maximum frequency. To accumulate, average and >>> appropriately decay thermal pressure, this patch series uses pelt >>> signals and reuses the available framework that does a similar >>> bookkeeping of rt/dl task utilization. >>> >>> Regarding testing, basic build, boot and sanity testing have been >>> performed on hikey960 mainline kernel with debian file system. >>> Further aobench (An occlusion renderer for benchmarking realworld >>> floating point performance) showed the following results on hikey960 >>> with debain. >>> >>> Result Standard Standard >>> (Time secs) Error Deviation >>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% >>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57% >>> >>> Thara Gopinath (7): >>> sched/pelt: Add option to make load and util calculations frequency >>> invariant >>> sched/pelt.c: Add support to track thermal pressure >>> sched: Add infrastructure to store and update instantaneous thermal >>> pressure >>> sched: Initialize per cpu thermal pressure structure >>> sched/fair: Enable CFS periodic tick to update thermal pressure >>> sched/fair: update cpu_capcity to reflect thermal pressure >>> thermal/cpu-cooling: Update thermal pressure in case of a maximum >>> frequency capping >>> >>> drivers/base/arch_topology.c | 1 + >>> drivers/thermal/cpu_cooling.c | 20 ++++++++++++- >>> include/linux/sched.h | 14 +++++++++ >>> kernel/sched/Makefile | 2 +- >>> kernel/sched/core.c | 2 ++ >>> kernel/sched/fair.c | 4 +++ >>> kernel/sched/pelt.c | 40 ++++++++++++++++++-------- >>> kernel/sched/pelt.h | 7 +++++ >>> kernel/sched/sched.h | 1 + >>> kernel/sched/thermal.c | 66 +++++++++++++++++++++++++++++++++++++++++++ >>> kernel/sched/thermal.h | 13 +++++++++ >>> 11 files changed, 157 insertions(+), 13 deletions(-) >>> create mode 100644 kernel/sched/thermal.c >>> create mode 100644 kernel/sched/thermal.h >>> > >
On 10/10/2018 07:30 PM, Thara Gopinath wrote: > Hello Lukasz, > > On 10/10/2018 11:35 AM, Lukasz Luba wrote: >> Hi Thara, >> >> I have run it on Exynos5433 mainline. >> When it is enabled with step_wise thermal governor, >> some of my tests are showing ~30-50% regression (i.e. hackbench), >> dhrystone ~10%. > > That is interesting. If I understand correctly, dhrystone spawns 16 > threads or so and floods the system. In "theory", such a test should not > see any performance improvement and degradation. What is the thermal > activity like in your system? I will try running one of these tests on > hikey960. I use this dhrystone implementation: https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c It does not span new threads/processes and I pinned it to a single cpu. My thermal setup is probably different than yours. You have (on hikey960) probably 1 sensor for whole SoC and one thermal zone (if it is this mainline file: arch/arm64/boot/dts/hisilicon/hi3660.dtsi). This thermal zone has two cooling devices - two clusters with dvfs. Your temperature signal read out from that sensor is probably much smoother. When you have sensor inside cluster, the rising factor can be even 20deg/s (for big cores). In my case, there are 4 thermal zones, each cluster has it's private sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor', which is recommended for IPA. >> >> Could you tell me which thermal governor was used in your case? >> Please also share the name of that benchmark, i will give it a try. >> Is it single threaded compute-intensive? > > Step-wise governor. > I use aobench which is part of phoronix-test-suite. > > Regards > Thara > I have built this aobench and run it pinned to single big cpu: time taskset -c 4 ./aobench The results: 3min-5:30min [mainline] 5:15min-5:50min [+patchset] The idea is definitely worth to investigate further. Regards, Lukasz
On 10/11/2018 10:23 AM, Daniel Lezcano wrote: > On 11/10/2018 09:35, Lukasz Luba wrote: >> Hi Daniel, >> >> On 10/10/2018 06:54 PM, Daniel Lezcano wrote: >>> On 10/10/2018 17:35, Lukasz Luba wrote: >>>> Hi Thara, >>>> >>>> I have run it on Exynos5433 mainline. >>>> When it is enabled with step_wise thermal governor, >>>> some of my tests are showing ~30-50% regression (i.e. hackbench), >>>> dhrystone ~10%. >>>> >>>> Could you tell me which thermal governor was used in your case? >>>> Please also share the name of that benchmark, i will give it a try. >>>> Is it single threaded compute-intensive? >>> >>> aobench AFAICT >>> >>> It would be interesting if you can share the thermal profile of your board. >>> >> Thanks for the benchmark name. >> It was tested on Samsung TM2 device with Exynos 5433 with debian. >> Thermal stuff you can find in mainline: >> arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi > > By thermal profile, I was meaning a figure with the temperature > regulation (temperature idle, then workload, then temperature increase, > mitigated temperature, end of workload, temperature decrease). Currently, I cannot share these data. > > The thermal description looks wrong in the DT. I suggest to experiment > the series with the DT fixed. Could you tell more what looks wrong or maybe send a draft/patch? > eg. from hi6220.dtsi > > > thermal-zones { > > cls0: cls0 { > polling-delay = <1000>; > polling-delay-passive = <100>; > sustainable-power = <3326>; > > /* sensor ID */ > thermal-sensors = <&tsensor 2>; > > trips { > threshold: trip-point@0 { > temperature = <65000>; > hysteresis = <0>; > type = "passive"; > }; > > target: trip-point@1 { > temperature = <75000>; > hysteresis = <0>; > type = "passive"; > }; > }; > > cooling-maps { > map0 { > trip = <&target>; > cooling-device = <&cpu0 > THERMAL_NO_LIMIT THERMAL_NO_LIMIT>; > }; > }; > }; > }; > > Note the cooling devices are *passive* > > For me this DT looks like it is copied from ARM Juno board for IPA, where AFAIR there were no interrupts for the temperature sensor (due to some psci/scpi/firmware lack of support). The cooling map is also short. I am not sure if it would work efficient with step-wise, with IPA it will work. The DT configuration for Exynos has been working OK for years. Exynos5433 supports 8 trip points, Exynos5422 has 4. So if you have more trip points you need to start 'polling'. Therefore, for Exynos there is no need to run queued work in thermal framework every 1s or 100ms just to read the current temperature. The exynos-tmu driver gets irq and calls thermal framework when such trip point is crossed. Then there is 'monitoring window' for the trip point... Long story short: 'active' is used for that reason. Regards, Lukasz
* Thara Gopinath <thara.gopinath@linaro.org> wrote: > >> Regarding testing, basic build, boot and sanity testing have been > >> performed on hikey960 mainline kernel with debian file system. > >> Further aobench (An occlusion renderer for benchmarking realworld > >> floating point performance) showed the following results on hikey960 > >> with debain. > >> > >> Result Standard Standard > >> (Time secs) Error Deviation > >> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% > >> Hikey 960 - thermal pressure applied 122.37 5.78 11.57% > > > > Wow, +13% speedup, impressive! We definitely want this outcome. > > > > I'm wondering what happens if we do not track and decay the thermal > > load at all at the PELT level, but instantaneously decrease/increase > > effective CPU capacity in reaction to thermal events we receive from > > the CPU. > > The problem with instantaneous update is that sometimes thermal events > happen at a much faster pace than cpu_capacity is updated in the > scheduler. This means that at the moment when scheduler uses the > value, it might not be correct anymore. Let me offer a different interpretation: if we average throttling events then we create a 'smooth' average of 'true CPU capacity' that doesn't fluctuate much. This allows more stable yet asymmetric task placement if the thermal characteristics of the different cores is different (asymmetric). This, compared to instantaneous updates, would reduce unnecessary task migrations between cores. Is that accurate? If the thermal characteristics of the cores is roughly symmetric and the measured CPU-intense load itself is symmetric as well, then I have trouble seeing why reacting to thermal events should make any difference at all. Are there any inherent asymmetries in the thermal properties of the cores, or in the benchmarked workload itself? Thanks, Ingo
Hi Lukasz, On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <l.luba@partner.samsung.com> wrote: > > > > On 10/10/2018 07:30 PM, Thara Gopinath wrote: > > Hello Lukasz, > > > > On 10/10/2018 11:35 AM, Lukasz Luba wrote: > >> Hi Thara, > >> > >> I have run it on Exynos5433 mainline. > >> When it is enabled with step_wise thermal governor, > >> some of my tests are showing ~30-50% regression (i.e. hackbench), > >> dhrystone ~10%. > > > > That is interesting. If I understand correctly, dhrystone spawns 16 > > threads or so and floods the system. In "theory", such a test should not > > see any performance improvement and degradation. What is the thermal > > activity like in your system? I will try running one of these tests on > > hikey960. > I use this dhrystone implementation: > https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c > It does not span new threads/processes and I pinned it to a single cpu. > > My thermal setup is probably different than yours. > You have (on hikey960) probably 1 sensor for whole SoC and one thermal > zone (if it is this mainline file: > arch/arm64/boot/dts/hisilicon/hi3660.dtsi). > This thermal zone has two cooling devices - two clusters with dvfs. > Your temperature signal read out from that sensor is probably much > smoother. When you have sensor inside cluster, the rising factor > can be even 20deg/s (for big cores). > In my case, there are 4 thermal zones, each cluster has it's private > sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor', > which is recommended for IPA. > >> > >> Could you tell me which thermal governor was used in your case? > >> Please also share the name of that benchmark, i will give it a try. > >> Is it single threaded compute-intensive? > > > > Step-wise governor. > > I use aobench which is part of phoronix-test-suite. > > > > Regards > > Thara > > > I have built this aobench and run it pinned to single big cpu: > time taskset -c 4 ./aobench Why have you pinned the test only on CPU4 ? Goal of thermal pressure is to inform the scheduler of reduced compute capacity and help the scheduler to take better decision in tasks placement. So I would not expect perf impact on your test as the bench will stay pinned on the cpu4 That being said you obviously have perf impact as shown below in your results And other tasks on the system are not pinned and might come and disturb your bench > The results: > 3min-5:30min [mainline] > 5:15min-5:50min [+patchset] > > The idea is definitely worth to investigate further. Yes I agree Vincent > > Regards, > Lukasz > > >
On 10/16/2018 03:33 AM, Ingo Molnar wrote: > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > >>>> Regarding testing, basic build, boot and sanity testing have been >>>> performed on hikey960 mainline kernel with debian file system. >>>> Further aobench (An occlusion renderer for benchmarking realworld >>>> floating point performance) showed the following results on hikey960 >>>> with debain. >>>> >>>> Result Standard Standard >>>> (Time secs) Error Deviation >>>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% >>>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57% >>> >>> Wow, +13% speedup, impressive! We definitely want this outcome. >>> >>> I'm wondering what happens if we do not track and decay the thermal >>> load at all at the PELT level, but instantaneously decrease/increase >>> effective CPU capacity in reaction to thermal events we receive from >>> the CPU. >> >> The problem with instantaneous update is that sometimes thermal events >> happen at a much faster pace than cpu_capacity is updated in the >> scheduler. This means that at the moment when scheduler uses the >> value, it might not be correct anymore. > > Let me offer a different interpretation: if we average throttling events > then we create a 'smooth' average of 'true CPU capacity' that doesn't > fluctuate much. This allows more stable yet asymmetric task placement if > the thermal characteristics of the different cores is different > (asymmetric). This, compared to instantaneous updates, would reduce > unnecessary task migrations between cores. > > Is that accurate? Yes. I think it is accurate. I will also add that if we don't average throttling events, we will miss the events that occur in between load balancing(LB) period. > > If the thermal characteristics of the cores is roughly symmetric and the > measured CPU-intense load itself is symmetric as well, then I have > trouble seeing why reacting to thermal events should make any difference > at all. In this scenario, i agree that scheduler reaction to thermal events should not make any difference in fact we should not observe any improvement or degradation in performance. > > Are there any inherent asymmetries in the thermal properties of the > cores, or in the benchmarked workload itself? The benchmarked workload , meaning aobench? I don't think there arre any asymmetries there. On Hikey960, there are two clusters with different frequency domains. So yes I will say there is asymmetry there. Asides, IMHO, any other tasks running on the system can create an inherent asymmetry as cpu utilizations can vary. Regards Thara > > Thanks, > > Ingo > -- Regards Thara
On 10/16/2018 01:11 PM, Vincent Guittot wrote: > Hi Lukasz, > > On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <l.luba@partner.samsung.com> wrote: >> >> >> >> On 10/10/2018 07:30 PM, Thara Gopinath wrote: >>> Hello Lukasz, >>> >>> On 10/10/2018 11:35 AM, Lukasz Luba wrote: >>>> Hi Thara, >>>> >>>> I have run it on Exynos5433 mainline. >>>> When it is enabled with step_wise thermal governor, >>>> some of my tests are showing ~30-50% regression (i.e. hackbench), >>>> dhrystone ~10%. >>> >>> That is interesting. If I understand correctly, dhrystone spawns 16 >>> threads or so and floods the system. In "theory", such a test should not >>> see any performance improvement and degradation. What is the thermal >>> activity like in your system? I will try running one of these tests on >>> hikey960. >> I use this dhrystone implementation: >> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c >> It does not span new threads/processes and I pinned it to a single cpu. >> >> My thermal setup is probably different than yours. >> You have (on hikey960) probably 1 sensor for whole SoC and one thermal >> zone (if it is this mainline file: >> arch/arm64/boot/dts/hisilicon/hi3660.dtsi). >> This thermal zone has two cooling devices - two clusters with dvfs. >> Your temperature signal read out from that sensor is probably much >> smoother. When you have sensor inside cluster, the rising factor >> can be even 20deg/s (for big cores). >> In my case, there are 4 thermal zones, each cluster has it's private >> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor', >> which is recommended for IPA. >>>> >>>> Could you tell me which thermal governor was used in your case? >>>> Please also share the name of that benchmark, i will give it a try. >>>> Is it single threaded compute-intensive? >>> >>> Step-wise governor. >>> I use aobench which is part of phoronix-test-suite. >>> >>> Regards >>> Thara >>> >> I have built this aobench and run it pinned to single big cpu: >> time taskset -c 4 ./aobench > > Why have you pinned the test only on CPU4 ? > Goal of thermal pressure is to inform the scheduler of reduced compute > capacity and help the scheduler to take better decision in tasks > placement. Hi Lukasz, I agree with Vincent's observation. I had not seen this earlier. Pinning a task to a cpu will obviously prevent migration. The performance regression could be due to as Vincent mentioned below other tasks in the system. On another note, which cpufreq governor are you using? Is the core being bumped up to highest possible OPP during the test? Regards Thara > So I would not expect perf impact on your test as the bench will stay > pinned on the cpu4 > That being said you obviously have perf impact as shown below in your results > And other tasks on the system are not pinned and might come and > disturb your bench > >> The results: >> 3min-5:30min [mainline] >> 5:15min-5:50min [+patchset] >> >> The idea is definitely worth to investigate further. > > Yes I agree > > Vincent >> >> Regards, >> Lukasz >> >> >> -- Regards Thara
On Thu, Oct 18, 2018 at 8:48 AM Ingo Molnar <mingo@kernel.org> wrote: > > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > > > On 10/16/2018 03:33 AM, Ingo Molnar wrote: > > > > > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > > > > > >>>> Regarding testing, basic build, boot and sanity testing have been > > >>>> performed on hikey960 mainline kernel with debian file system. > > >>>> Further aobench (An occlusion renderer for benchmarking realworld > > >>>> floating point performance) showed the following results on hikey960 > > >>>> with debain. > > >>>> > > >>>> Result Standard Standard > > >>>> (Time secs) Error Deviation > > >>>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% > > >>>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57% > > >>> > > >>> Wow, +13% speedup, impressive! We definitely want this outcome. > > >>> > > >>> I'm wondering what happens if we do not track and decay the thermal > > >>> load at all at the PELT level, but instantaneously decrease/increase > > >>> effective CPU capacity in reaction to thermal events we receive from > > >>> the CPU. > > >> > > >> The problem with instantaneous update is that sometimes thermal events > > >> happen at a much faster pace than cpu_capacity is updated in the > > >> scheduler. This means that at the moment when scheduler uses the > > >> value, it might not be correct anymore. > > > > > > Let me offer a different interpretation: if we average throttling events > > > then we create a 'smooth' average of 'true CPU capacity' that doesn't > > > fluctuate much. This allows more stable yet asymmetric task placement if > > > the thermal characteristics of the different cores is different > > > (asymmetric). This, compared to instantaneous updates, would reduce > > > unnecessary task migrations between cores. > > > > > > Is that accurate? > > > > Yes. I think it is accurate. I will also add that if we don't average > > throttling events, we will miss the events that occur in between load > > balancing(LB) period. > > Yeah, so I'd definitely suggest to not integrate this averaging into > pelt.c in the fashion presented, because: > > - This couples your thermal throttling averaging to the PELT decay > half-time AFAICS, which would break the other user every time the > decay is changed/tuned. > > - The boolean flag that changes behavior in pelt.c is not particularly > clean either and complicates the code. > > - Instead maybe factor out a decaying average library into > kernel/sched/avg.h perhaps (if this truly improves the code), and use > those methods both in pelt.c and any future thermal.c - and maybe > other places where we do decaying averages. > > - But simple decaying averages are not that complex either, so I think > your original solution of open coding it is probably fine as well. > > Furthermore, any logic introduced by thermal.c and the resulting change > to load-balancing behavior would have to be in perfect sync with cpufreq > governor actions - one mechanism should not work against the other. Right, that really is required. > The only long term maintainable solution is to move all high level > cpufreq logic and policy handling code into kernel/sched/cpufreq*.c, > which has been done to a fair degree already in the past ~2 years - but > it's unclear to me to what extent this is true for thermal throttling > policy currently: there might be more governor surgery and code > reshuffling required? It doesn't cover thermal management directly ATM. The EAS work kind of hopes to make a connection in there by adding a common energy model to underlie both the performance scaling and thermal management, but it doesn't change the thermal decision making part AFAICS. So it is fair to say that additional governor surgery and code reshuffling will be required IMO. Thanks, Rafael
* Rafael J. Wysocki <rafael@kernel.org> wrote: > > The only long term maintainable solution is to move all high level > > cpufreq logic and policy handling code into kernel/sched/cpufreq*.c, > > which has been done to a fair degree already in the past ~2 years - but > > it's unclear to me to what extent this is true for thermal throttling > > policy currently: there might be more governor surgery and code > > reshuffling required? > > It doesn't cover thermal management directly ATM. > > The EAS work kind of hopes to make a connection in there by adding a > common energy model to underlie both the performance scaling and > thermal management, but it doesn't change the thermal decision making > part AFAICS. > > So it is fair to say that additional governor surgery and code > reshuffling will be required IMO. BTW., when factoring out high level thermal management code it might make sense to increase the prominence of the cpufreq code within the scheduler and organize it a bit better, by introducing its own kernel/sched/cpufreq/ directory and renaming things the following way: kernel/sched/cpufreq.c => kernel/sched/cpufreq/core.c kernel/sched/cpufreq_schedutil.c => kernel/sched/cpufreq/metrics.c kernel/sched/thermal.c => kernel/sched/cpufreq/thermal.c ... or so? With no change to functionality, this is just a re-organization and expansion/preparation for the bright future. =B-) Thanks, Ingo
On 10/17/2018 06:24 PM, Thara Gopinath wrote: > On 10/16/2018 01:11 PM, Vincent Guittot wrote: >> Hi Lukasz, >> >> On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <l.luba@partner.samsung.com> wrote: >>> >>> >>> >>> On 10/10/2018 07:30 PM, Thara Gopinath wrote: >>>> Hello Lukasz, >>>> >>>> On 10/10/2018 11:35 AM, Lukasz Luba wrote: >>>>> Hi Thara, >>>>> >>>>> I have run it on Exynos5433 mainline. >>>>> When it is enabled with step_wise thermal governor, >>>>> some of my tests are showing ~30-50% regression (i.e. hackbench), >>>>> dhrystone ~10%. >>>> >>>> That is interesting. If I understand correctly, dhrystone spawns 16 >>>> threads or so and floods the system. In "theory", such a test should not >>>> see any performance improvement and degradation. What is the thermal >>>> activity like in your system? I will try running one of these tests on >>>> hikey960. >>> I use this dhrystone implementation: >>> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c >>> It does not span new threads/processes and I pinned it to a single cpu. >>> >>> My thermal setup is probably different than yours. >>> You have (on hikey960) probably 1 sensor for whole SoC and one thermal >>> zone (if it is this mainline file: >>> arch/arm64/boot/dts/hisilicon/hi3660.dtsi). >>> This thermal zone has two cooling devices - two clusters with dvfs. >>> Your temperature signal read out from that sensor is probably much >>> smoother. When you have sensor inside cluster, the rising factor >>> can be even 20deg/s (for big cores). >>> In my case, there are 4 thermal zones, each cluster has it's private >>> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor', >>> which is recommended for IPA. >>>>> >>>>> Could you tell me which thermal governor was used in your case? >>>>> Please also share the name of that benchmark, i will give it a try. >>>>> Is it single threaded compute-intensive? >>>> >>>> Step-wise governor. >>>> I use aobench which is part of phoronix-test-suite. >>>> >>>> Regards >>>> Thara >>>> >>> I have built this aobench and run it pinned to single big cpu: >>> time taskset -c 4 ./aobench >> >> Why have you pinned the test only on CPU4 ? >> Goal of thermal pressure is to inform the scheduler of reduced compute >> capacity and help the scheduler to take better decision in tasks >> placement. > > Hi Lukasz, > > I agree with Vincent's observation. I had not seen this earlier. Pinning > a task to a cpu will obviously prevent migration. The performance > regression could be due to as Vincent mentioned below other tasks in the > system. On another note, which cpufreq governor are you using? Is the > core being bumped up to highest possible OPP during the test? The governor is ondemand. No, it is not at highest OPP. Could you send me the needed test configuration and condition? We would then align in this area. Regards, Lukasz > > Regards > Thara >> So I would not expect perf impact on your test as the bench will stay >> pinned on the cpu4 >> That being said you obviously have perf impact as shown below in your results >> And other tasks on the system are not pinned and might come and >> disturb your bench >> >>> The results: >>> 3min-5:30min [mainline] >>> 5:15min-5:50min [+patchset] >>> >>> The idea is definitely worth to investigate further. >> >> Yes I agree >> >> Vincent >>> >>> Regards, >>> Lukasz >>> >>> >>> > >
On Thu, Oct 18, 2018 at 9:50 AM Ingo Molnar <mingo@kernel.org> wrote: > > > * Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > The only long term maintainable solution is to move all high level > > > cpufreq logic and policy handling code into kernel/sched/cpufreq*.c, > > > which has been done to a fair degree already in the past ~2 years - but > > > it's unclear to me to what extent this is true for thermal throttling > > > policy currently: there might be more governor surgery and code > > > reshuffling required? > > > > It doesn't cover thermal management directly ATM. > > > > The EAS work kind of hopes to make a connection in there by adding a > > common energy model to underlie both the performance scaling and > > thermal management, but it doesn't change the thermal decision making > > part AFAICS. > > > > So it is fair to say that additional governor surgery and code > > reshuffling will be required IMO. > > BTW., when factoring out high level thermal management code it might make > sense to increase the prominence of the cpufreq code within the scheduler > and organize it a bit better, by introducing its own > kernel/sched/cpufreq/ directory and renaming things the following way: > > kernel/sched/cpufreq.c => kernel/sched/cpufreq/core.c > kernel/sched/cpufreq_schedutil.c => kernel/sched/cpufreq/metrics.c > kernel/sched/thermal.c => kernel/sched/cpufreq/thermal.c > > ... or so? > > With no change to functionality, this is just a re-organization and > expansion/preparation for the bright future. =B-) No disagreement here. :-) Cheers, Rafael
On 10/18/2018 02:48 AM, Ingo Molnar wrote: > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > >> On 10/16/2018 03:33 AM, Ingo Molnar wrote: >>> >>> * Thara Gopinath <thara.gopinath@linaro.org> wrote: >>> >>>>>> Regarding testing, basic build, boot and sanity testing have been >>>>>> performed on hikey960 mainline kernel with debian file system. >>>>>> Further aobench (An occlusion renderer for benchmarking realworld >>>>>> floating point performance) showed the following results on hikey960 >>>>>> with debain. >>>>>> >>>>>> Result Standard Standard >>>>>> (Time secs) Error Deviation >>>>>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52% >>>>>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57% >>>>> >>>>> Wow, +13% speedup, impressive! We definitely want this outcome. >>>>> >>>>> I'm wondering what happens if we do not track and decay the thermal >>>>> load at all at the PELT level, but instantaneously decrease/increase >>>>> effective CPU capacity in reaction to thermal events we receive from >>>>> the CPU. >>>> >>>> The problem with instantaneous update is that sometimes thermal events >>>> happen at a much faster pace than cpu_capacity is updated in the >>>> scheduler. This means that at the moment when scheduler uses the >>>> value, it might not be correct anymore. >>> >>> Let me offer a different interpretation: if we average throttling events >>> then we create a 'smooth' average of 'true CPU capacity' that doesn't >>> fluctuate much. This allows more stable yet asymmetric task placement if >>> the thermal characteristics of the different cores is different >>> (asymmetric). This, compared to instantaneous updates, would reduce >>> unnecessary task migrations between cores. >>> >>> Is that accurate? >> >> Yes. I think it is accurate. I will also add that if we don't average >> throttling events, we will miss the events that occur in between load >> balancing(LB) period. > > Yeah, so I'd definitely suggest to not integrate this averaging into > pelt.c in the fashion presented, because: > > - This couples your thermal throttling averaging to the PELT decay > half-time AFAICS, which would break the other user every time the > decay is changed/tuned. Let me pose the question in this manner. Today rt utilization, dl utilization etc is tracked via PELT. The inherent idea is that, a cpu has some capacity stolen by let us say rt task and so subtract the capacity utilized by the rt task from cpu when calculating the remaining capacity for a CFS task. Now, the idea behind thermal pressure is that, the maximum available capacity of a cpu is limited due to a thermal event, so take it out of the remaining capacity of a cpu for a CFS task (at-least to start with). If utilization for rt task, dl task etc is calculated via PELT and the capacity constraint due to thermal event calculated by another averaging algorithm, there can be some mismatch in the "capacity stolen" calculations, right? Isnt't it better to track all the events that can limit the capacity of a cpu via one algorithm ? > > - The boolean flag that changes behavior in pelt.c is not particularly > clean either and complicates the code. I agree. Part of the idea behind this RFC patch set was to brainstorm if there is a better approach to this. > > - Instead maybe factor out a decaying average library into > kernel/sched/avg.h perhaps (if this truly improves the code), and use > those methods both in pelt.c and any future thermal.c - and maybe > other places where we do decaying averages. > > - But simple decaying averages are not that complex either, so I think > your original solution of open coding it is probably fine as well. > > Furthermore, any logic introduced by thermal.c and the resulting change > to load-balancing behavior would have to be in perfect sync with cpufreq > governor actions - one mechanism should not work against the other. I agree. I will go one step further and argue that the changes here make best sense with sched util governor as it picks up the signals directly from the scheduler to choose an OPP. Any other governor will be less effective. > > The only long term maintainable solution is to move all high level > cpufreq logic and policy handling code into kernel/sched/cpufreq*.c, > which has been done to a fair degree already in the past ~2 years - but > it's unclear to me to what extent this is true for thermal throttling > policy currently: there might be more governor surgery and code > reshuffling required? > > The short term goal would be to at minimum have all the bugs lined up in > kernel/sched/* neatly, so that we have the chance to see and fix them in > a single place. ;-) I see that Daniel has already sent some patches for this! Regards Thara > > Thanks, > > Ingo > -- Regards Thara
* Thara Gopinath <thara.gopinath@linaro.org> wrote: > > Yeah, so I'd definitely suggest to not integrate this averaging into > > pelt.c in the fashion presented, because: > > > > - This couples your thermal throttling averaging to the PELT decay > > half-time AFAICS, which would break the other user every time the > > decay is changed/tuned. > > Let me pose the question in this manner. Today rt utilization, dl > utilization etc is tracked via PELT. The inherent idea is that, a cpu > has some capacity stolen by let us say rt task and so subtract the > capacity utilized by the rt task from cpu when calculating the > remaining capacity for a CFS task. Now, the idea behind thermal > pressure is that, the maximum available capacity of a cpu is limited > due to a thermal event, so take it out of the remaining capacity of a > cpu for a CFS task (at-least to start with). If utilization for rt > task, dl task etc is calculated via PELT and the capacity constraint > due to thermal event calculated by another averaging algorithm, there > can be some mismatch in the "capacity stolen" calculations, right? > Isnt't it better to track all the events that can limit the capacity of > a cpu via one algorithm ? So what unifies RT and DL utilization is that those are all direct task loads independent of external factors. Thermal load is more of a complex physical property of the combination of various internal and external factors: the whole system workload running (not just that single task), the thermal topology of the hardware, external temperatures, the hardware's and the governor's policy regarding thermal loads, etc. etc. So while obviously when effective capacity of a CPU is calculated then these will all be subtracted from the maximum capacity of the CPU, but I think the thermal load metric and the averaging itself is probably dissimilar enough to not be calculated via the PELT half-life for example. For example a reasonable future property would be match the speed of decay in the averaging to the observed speed of decay via temperature sensors? Most temperature sensors do a certain amount of averaging themselves as well - and some platforms might not expose temperatures at all, only 'got thermally throttled' / 'running at full speed' kind of feedback. Anyway, this doesn't really impact the concept, it's an implementational detail, and much of this could be resolved if the averaging code in pelt.c was librarized a bit - and that's really what you did there in a fashion, I just think it should probably be abstracted out more clearly. (I have no clear implementational suggestions right now, other than 'try and see how it works out - it might be a bad idea'.) Thanks, Ingo
Hi, On 19/10/2018 09:02, Ingo Molnar wrote: > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: [...] > So what unifies RT and DL utilization is that those are all direct task > loads independent of external factors. > > Thermal load is more of a complex physical property of the combination of > various internal and external factors: the whole system workload running > (not just that single task), the thermal topology of the hardware, > external temperatures, the hardware's and the governor's policy regarding > thermal loads, etc. etc. > > So while obviously when effective capacity of a CPU is calculated then > these will all be subtracted from the maximum capacity of the CPU, but I > think the thermal load metric and the averaging itself is probably > dissimilar enough to not be calculated via the PELT half-life for > example. > > For example a reasonable future property would be match the speed of > decay in the averaging to the observed speed of decay via temperature > sensors? Most temperature sensors do a certain amount of averaging > themselves as well - and some platforms might not expose temperatures at > all, only 'got thermally throttled' / 'running at full speed' kind of > feedback. That would also open the door to having different decay speeds on different domains, if we have the tsensors for it - big and LITTLE cores are not going to heat up in the same way (although there's going to be some heat propagation). Another thermal decay speed hint I'd see would be the energy model - it does tell us after all how much energy is going through those cores, and with a rough estimate of how much they can take before overheating (sustainable-power entry in the devicetree) we might be able to deduce a somewhat sane decay speed. > > Anyway, this doesn't really impact the concept, it's an implementational > detail, and much of this could be resolved if the averaging code in > pelt.c was librarized a bit - and that's really what you did there in a > fashion, I just think it should probably be abstracted out more clearly. > (I have no clear implementational suggestions right now, other than 'try > and see how it works out - it might be a bad idea'.) > > Thanks, > > Ingo > > >
Hi Thara, On Tue, 9 Oct 2018 at 18:25, Thara Gopinath <thara.gopinath@linaro.org> wrote: > > Introduce support in CFS periodic tick to trigger the process of > computing average thermal pressure for a cpu. > > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> > --- > kernel/sched/fair.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b39fb59..7deb1d0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -21,6 +21,7 @@ > * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra > */ > #include "sched.h" > +#include "thermal.h" > > #include <trace/events/sched.h> > > @@ -9557,6 +9558,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) > > if (static_branch_unlikely(&sched_numa_balancing)) > task_tick_numa(rq, curr); > + > + update_periodic_maxcap(rq); You have to call update_periodic_maxcap() in update_blocked_averages() too Otherwise, the thermal pressure will not always be updated correctly for tickless system > } > > /* > -- > 2.1.4 >