Message ID | 1555443521-579-1-git-send-email-thara.gopinath@linaro.org |
---|---|
Headers | show |
Series | Introduce Thermal Pressure | expand |
* Thara Gopinath <thara.gopinath@linaro.org> wrote: > The test results below shows 3-5% improvement in performance when > using the third solution compared to the default system today where > scheduler is unware of cpu capacity limitations due to thermal events. The numbers look very promising! I've rearranged the results to make the performance properties of the various approaches and parameters easier to see: (seconds, lower is better) Hackbench Aobench Dhrystone ========= ======= ========= Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14 Instantaneous thermal pressure 10.16 141.63 1.15 Thermal Pressure Averaging: - PELT fmwk 9.88 134.48 1.19 - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09 - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012 - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12 Firstly, a couple of questions about the numbers: 1) Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012? You reported it as: non-PELT Algo. Decay : 250 ms 1.012 7.02% But the formatting is significant 3 digits versus only two for all the other results. 2) You reported the hackbench numbers with "10 runs" - did the other benchmarks use 10 runs as well? Maybe you used fewer runs for the longest benchmark, Aobench? Secondly, it appears the non-PELT decaying average is the best approach, but the results are a bit coarse around the ~250 msecs peak. Maybe it would be good to measure it in 50 msecs steps between 50 msecs and 1000 msecs - but only if it can be scripted sanely: A possible approach would be to add a debug sysctl for the tuning period, and script all these benchmark runs and the printing of the results. You could add another (debug) sysctl to turn the 'instant' logic on, and to restore vanilla kernel behavior as well - this makes it all much easier to script and measure with a single kernel image, without having to reboot the kernel. The sysctl overhead will not be measurable for workloads like this. Then you can use "perf stat --null --table" to measure runtime and stddev easily and with a single tool, for example: dagon:~> perf stat --null --sync --repeat 10 --table ./hackbench 20 >benchmark.out Performance counter stats for './hackbench 20' (10 runs): # Table of individual measurements: 0.15246 (-0.03960) ###### 0.20832 (+0.01627) ## 0.17895 (-0.01310) ## 0.19791 (+0.00585) # 0.19209 (+0.00004) # 0.19406 (+0.00201) # 0.22484 (+0.03278) ### 0.18695 (-0.00511) # 0.19032 (-0.00174) # 0.19464 (+0.00259) # # Final result: 0.19205 +- 0.00592 seconds time elapsed ( +- 3.08% ) Note how all the individual measurements can be captured this way, without seeing the benchmark output itself. So difference benchmarks can be measured this way, assuming they don't have too long setup time. Thanks, Ingo
* Ingo Molnar <mingo@kernel.org> wrote: > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > > > The test results below shows 3-5% improvement in performance when > > using the third solution compared to the default system today where > > scheduler is unware of cpu capacity limitations due to thermal events. > > The numbers look very promising! > > I've rearranged the results to make the performance properties of the > various approaches and parameters easier to see: > > (seconds, lower is better) > > Hackbench Aobench Dhrystone > ========= ======= ========= > Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14 > Instantaneous thermal pressure 10.16 141.63 1.15 > Thermal Pressure Averaging: > - PELT fmwk 9.88 134.48 1.19 > - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09 > - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012 > - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12 So what I forgot to say is that IMO your results show robust improvements over the vanilla kernel of around 5%, with a relatively straightforward thermal pressure metric. So I suspect we could do something like this, if there was a bit more measurements done to get the best decay period established - the 125-250-500 msecs results seem a bit coarse and not entirely unambiguous. In terms of stddev: the perf stat --pre hook could be used to add a dummy benchmark run, to heat up the test system, to get more reliable, less noisy numbers? BTW., that big improvement in hackbench results to 7.52 at 250 msecs, is that real, or a fluke perhaps? Thanks, Ingo
On 04/17/2019 01:36 AM, Ingo Molnar wrote: > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > >> The test results below shows 3-5% improvement in performance when >> using the third solution compared to the default system today where >> scheduler is unware of cpu capacity limitations due to thermal events. > > The numbers look very promising! Hello Ingo, Thank you for the review. > > I've rearranged the results to make the performance properties of the > various approaches and parameters easier to see: > > (seconds, lower is better) > > Hackbench Aobench Dhrystone > ========= ======= ========= > Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14 > Instantaneous thermal pressure 10.16 141.63 1.15 > Thermal Pressure Averaging: > - PELT fmwk 9.88 134.48 1.19 > - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09 > - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012 > - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12 > > > Firstly, a couple of questions about the numbers: > > 1) > > Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012? > You reported it as: > > non-PELT Algo. Decay : 250 ms 1.012 7.02% It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks multiple time because of the anomalies noticed. 1.012 is a formatting error on my part when I copy pasted the results into a google sheet I am maintaining to capture the test results. Sorry about the confusion. > > But the formatting is significant 3 digits versus only two for all > the other results. > > 2) > > You reported the hackbench numbers with "10 runs" - did the other > benchmarks use 10 runs as well? Maybe you used fewer runs for the > longest benchmark, Aobench? Hackbench and dhrystone are 10 runs each. Aobench is part of phoronix test suit and the test suite runs it six times and gives the per run results, mean and stddev. On my part, I ran aobench just once per configuration. > > Secondly, it appears the non-PELT decaying average is the best approach, > but the results are a bit coarse around the ~250 msecs peak. Maybe it > would be good to measure it in 50 msecs steps between 50 msecs and 1000 > msecs - but only if it can be scripted sanely: non-PELT looks better overall because the test results are quite comparable (if not better) between the two solutions and it takes care of concerns people raised when I posted V1 using PELT-fmwk algo regarding reuse of utilization signal to track thermal pressure. Regarding the decay period, I agree that more testing can be done. I like your suggestions below and I am going to try implementing them sometime next week. Once I have some solid results, I will send them out. My concern regarding getting hung up too much on decay period is that I think it could vary from SoC to SoC depending on the type and number of cores and thermal characteristics. So I was thinking eventually the decay period should be configurable via a config option or by any other means. Testing on different systems will definitely help and maybe I am wrong and there is no much variation between systems. Regards Thara > > A possible approach would be to add a debug sysctl for the tuning period, > and script all these benchmark runs and the printing of the results. You > could add another (debug) sysctl to turn the 'instant' logic on, and to > restore vanilla kernel behavior as well - this makes it all much easier > to script and measure with a single kernel image, without having to > reboot the kernel. The sysctl overhead will not be measurable for > workloads like this. > > Then you can use "perf stat --null --table" to measure runtime and stddev > easily and with a single tool, for example: > > dagon:~> perf stat --null --sync --repeat 10 --table ./hackbench 20 >benchmark.out > > Performance counter stats for './hackbench 20' (10 runs): > > # Table of individual measurements: > 0.15246 (-0.03960) ###### > 0.20832 (+0.01627) ## > 0.17895 (-0.01310) ## > 0.19791 (+0.00585) # > 0.19209 (+0.00004) # > 0.19406 (+0.00201) # > 0.22484 (+0.03278) ### > 0.18695 (-0.00511) # > 0.19032 (-0.00174) # > 0.19464 (+0.00259) # > > # Final result: > 0.19205 +- 0.00592 seconds time elapsed ( +- 3.08% ) > > Note how all the individual measurements can be captured this way, > without seeing the benchmark output itself. So difference benchmarks can > be measured this way, assuming they don't have too long setup time. > > Thanks, > > Ingo > -- Regards Thara
On 04/17/2019 01:55 AM, Ingo Molnar wrote: > > * Ingo Molnar <mingo@kernel.org> wrote: > >> * Thara Gopinath <thara.gopinath@linaro.org> wrote: >> >>> The test results below shows 3-5% improvement in performance when >>> using the third solution compared to the default system today where >>> scheduler is unware of cpu capacity limitations due to thermal events. >> >> The numbers look very promising! >> >> I've rearranged the results to make the performance properties of the >> various approaches and parameters easier to see: >> >> (seconds, lower is better) >> >> Hackbench Aobench Dhrystone >> ========= ======= ========= >> Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14 >> Instantaneous thermal pressure 10.16 141.63 1.15 >> Thermal Pressure Averaging: >> - PELT fmwk 9.88 134.48 1.19 >> - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09 >> - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012 >> - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12 > > So what I forgot to say is that IMO your results show robust improvements > over the vanilla kernel of around 5%, with a relatively straightforward > thermal pressure metric. So I suspect we could do something like this, if > there was a bit more measurements done to get the best decay period > established - the 125-250-500 msecs results seem a bit coarse and not > entirely unambiguous. To give you the background, I started with decay period of 500 ms. No other reason except the previous version of rt-pressure that existed in the scheduler employed a 500 ms decay period. Then the idea was to decrease the decay period by half and see what happens and so on. But I agree, that it is a bit coarse. I will probably get around to implementing some of your suggestions to capture more granular results in the next few weeks. > > In terms of stddev: the perf stat --pre hook could be used to add a dummy > benchmark run, to heat up the test system, to get more reliable, less > noisy numbers? > > BTW., that big improvement in hackbench results to 7.52 at 250 msecs, is > that real, or a fluke perhaps? For me, it is an anomaly. Having said that, I did rerun the tests with this configuration at least twice(if not more) and the results were similar. It is an anomaly because I have no explanation as to why there is so much improvement at the 250 ms decay period. > > Thanks, > > Ingo > -- Regards Thara
* Thara Gopinath <thara.gopinath@linaro.org> wrote: > > On 04/17/2019 01:36 AM, Ingo Molnar wrote: > > > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > > > >> The test results below shows 3-5% improvement in performance when > >> using the third solution compared to the default system today where > >> scheduler is unware of cpu capacity limitations due to thermal events. > > > > The numbers look very promising! > > Hello Ingo, > Thank you for the review. > > > > I've rearranged the results to make the performance properties of the > > various approaches and parameters easier to see: > > > > (seconds, lower is better) > > > > Hackbench Aobench Dhrystone > > ========= ======= ========= > > Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14 > > Instantaneous thermal pressure 10.16 141.63 1.15 > > Thermal Pressure Averaging: > > - PELT fmwk 9.88 134.48 1.19 > > - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09 > > - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012 > > - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12 > > > > > > Firstly, a couple of questions about the numbers: > > > > 1) > > > > Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012? > > You reported it as: > > > > non-PELT Algo. Decay : 250 ms 1.012 7.02% > > It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks > multiple time because of the anomalies noticed. 1.012 is a formatting > error on my part when I copy pasted the results into a google sheet I am > maintaining to capture the test results. Sorry about the confusion. That's actually pretty good, because it suggests a 35% and 15% improvement over the vanilla kernel - which is very good for such CPU-bound workloads. Not that 5% is bad in itself - but 15% is better ;-) > Regarding the decay period, I agree that more testing can be done. I > like your suggestions below and I am going to try implementing them > sometime next week. Once I have some solid results, I will send them > out. Thanks! > My concern regarding getting hung up too much on decay period is that I > think it could vary from SoC to SoC depending on the type and number of > cores and thermal characteristics. So I was thinking eventually the > decay period should be configurable via a config option or by any other > means. Testing on different systems will definitely help and maybe I am > wrong and there is no much variation between systems. Absolutely, so I'd not be against keeping it a SCHED_DEBUG tunable or so, until there's a better understanding of how the physical properties of the SoC map to an ideal decay period. Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load tracking approach. I suppose there's some connection of this to Energy Aware Scheduling? Or not ... Thanks, Ingo
On 04/17/2019 02:29 PM, Ingo Molnar wrote: > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > >> >> On 04/17/2019 01:36 AM, Ingo Molnar wrote: >>> >>> * Thara Gopinath <thara.gopinath@linaro.org> wrote: >>> >>>> The test results below shows 3-5% improvement in performance when >>>> using the third solution compared to the default system today where >>>> scheduler is unware of cpu capacity limitations due to thermal events. >>> >>> The numbers look very promising! >> >> Hello Ingo, >> Thank you for the review. >>> >>> I've rearranged the results to make the performance properties of the >>> various approaches and parameters easier to see: >>> >>> (seconds, lower is better) >>> >>> Hackbench Aobench Dhrystone >>> ========= ======= ========= >>> Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14 >>> Instantaneous thermal pressure 10.16 141.63 1.15 >>> Thermal Pressure Averaging: >>> - PELT fmwk 9.88 134.48 1.19 >>> - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09 >>> - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012 >>> - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12 >>> >>> >>> Firstly, a couple of questions about the numbers: >>> >>> 1) >>> >>> Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012? >>> You reported it as: >>> >>> non-PELT Algo. Decay : 250 ms 1.012 7.02% >> >> It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks >> multiple time because of the anomalies noticed. 1.012 is a formatting >> error on my part when I copy pasted the results into a google sheet I am >> maintaining to capture the test results. Sorry about the confusion. > > That's actually pretty good, because it suggests a 35% and 15% > improvement over the vanilla kernel - which is very good for such > CPU-bound workloads. > > Not that 5% is bad in itself - but 15% is better ;-) > >> Regarding the decay period, I agree that more testing can be done. I >> like your suggestions below and I am going to try implementing them >> sometime next week. Once I have some solid results, I will send them >> out. > > Thanks! > >> My concern regarding getting hung up too much on decay period is that I >> think it could vary from SoC to SoC depending on the type and number of >> cores and thermal characteristics. So I was thinking eventually the >> decay period should be configurable via a config option or by any other >> means. Testing on different systems will definitely help and maybe I am >> wrong and there is no much variation between systems. > > Absolutely, so I'd not be against keeping it a SCHED_DEBUG tunable or so, > until there's a better understanding of how the physical properties of > the SoC map to an ideal decay period. > > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load > tracking approach. I suppose there's some connection of this to Energy > Aware Scheduling? Or not ... Mmm.. Not so much. This does not have much to do with EAS. The feature itself will be really useful if there are asymmetric cpus in the system rather than symmetric cpus. In case of SMP, since all cores have same capacity and assuming any thermal mitigation will be implemented across the all the cpus, there won't be any different scheduler behavior. Regards Thara > > Thanks, > > Ingo > -- Regards Thara
On Wednesday 17 Apr 2019 at 20:29:32 (+0200), Ingo Molnar wrote: > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > > > > > On 04/17/2019 01:36 AM, Ingo Molnar wrote: > > > > > > * Thara Gopinath <thara.gopinath@linaro.org> wrote: > > > > > >> The test results below shows 3-5% improvement in performance when > > >> using the third solution compared to the default system today where > > >> scheduler is unware of cpu capacity limitations due to thermal events. > > > > > > The numbers look very promising! > > > > Hello Ingo, > > Thank you for the review. > > > > > > I've rearranged the results to make the performance properties of the > > > various approaches and parameters easier to see: > > > > > > (seconds, lower is better) > > > > > > Hackbench Aobench Dhrystone > > > ========= ======= ========= > > > Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14 > > > Instantaneous thermal pressure 10.16 141.63 1.15 > > > Thermal Pressure Averaging: > > > - PELT fmwk 9.88 134.48 1.19 > > > - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09 > > > - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012 > > > - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12 > > > > > > > > > Firstly, a couple of questions about the numbers: > > > > > > 1) > > > > > > Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012? > > > You reported it as: > > > > > > non-PELT Algo. Decay : 250 ms 1.012 7.02% > > > > It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks > > multiple time because of the anomalies noticed. 1.012 is a formatting > > error on my part when I copy pasted the results into a google sheet I am > > maintaining to capture the test results. Sorry about the confusion. > > That's actually pretty good, because it suggests a 35% and 15% > improvement over the vanilla kernel - which is very good for such > CPU-bound workloads. > > Not that 5% is bad in itself - but 15% is better ;-) > > > Regarding the decay period, I agree that more testing can be done. I > > like your suggestions below and I am going to try implementing them > > sometime next week. Once I have some solid results, I will send them > > out. > > Thanks! > > > My concern regarding getting hung up too much on decay period is that I > > think it could vary from SoC to SoC depending on the type and number of > > cores and thermal characteristics. So I was thinking eventually the > > decay period should be configurable via a config option or by any other > > means. Testing on different systems will definitely help and maybe I am > > wrong and there is no much variation between systems. > > Absolutely, so I'd not be against keeping it a SCHED_DEBUG tunable or so, > until there's a better understanding of how the physical properties of > the SoC map to an ideal decay period. +1, that'd be really useful to try this out on several platforms. > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load > tracking approach. I certainly don't hate it :-) In fact we already have something in the Android kernel to reflect thermal pressure into the CPU capacity using the 'instantaneous' approach. I'm all in favour of replacing our out-of-tree stuff by a mainline solution, and even more if that performs better. So yes, we need to discuss the implementation details and all, but I'd personally be really happy to see something upstream in this area. > I suppose there's some connection of this to Energy > Aware Scheduling? Or not ... Hmm, there isn't an immediate connection, I think. But that could change. FWIW I'm currently pushing a patch-set to make the thermal subsystem use the same Energy Model as EAS ([1]) instead of its own. There are several good reasons to do this, but one of them is to make sure the scheduler and the thermal stuff (and the rest of the kernel) have a consistent definition of what 'power' means. That might enable us do smart things in the scheduler, but that's really for later. Thanks, Quentin [1] https://lore.kernel.org/lkml/20190417094301.17622-1-quentin.perret@arm.com/
Hi Thara, The idea and the results look promising. I'm trying to understand better the cause of the improvements so I've added below some questions that would help me out with this. > Regarding testing, basic build, boot and sanity testing have been > performed on hikey960 mainline kernel with debian file system. > Further, aobench (An occlusion renderer for benchmarking realworld > floating point performance), dhrystone and hackbench test have been > run with the thermal pressure algorithm. During testing, due to > constraints of step wise governor in dealing with big little systems, > cpu cooling was disabled on little core, the idea being that > big core will heat up and cpu cooling device will throttle the > frequency of the big cores there by limiting the maximum available > capacity and the scheduler will spread out tasks to little cores as well. > Finally, this patch series has been boot tested on db410C running v5.1-rc4 > kernel. > Did you try using IPA as well? It is better equipped to deal with big-LITTLE systems and it's more probable IPA will be used for these systems, where your solution will have the biggest impact as well. The difference will be that you'll have both the big cluster and the LITTLE cluster capped in different proportions depending on their utilization and their efficiency. > During the course of development various methods of capturing > and reflecting thermal pressure were implemented. > > The first method to be evaluated was to convert the > capped max frequency into capacity and have the scheduler use the > instantaneous value when updating cpu_capacity. > This method is referenced as "Instantaneous Thermal Pressure" in the > test results below. > > The next two methods employs different methods of averaging the > thermal pressure before applying it when updating cpu_capacity. > The first of these methods re-used the PELT algorithm already present > in the kernel that does the averaging of rt and dl load and utilization. > This method is referenced as "Thermal Pressure Averaging using PELT fmwk" > in the test results below. > > The final method employs an averaging algorithm that collects and > decays thermal pressure based on the decay period. In this method, > the decay period is configurable. This method is referenced as > "Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the > test results below. > > The test results below shows 3-5% improvement in performance when > using the third solution compared to the default system today where > scheduler is unware of cpu capacity limitations due to thermal events. > Did you happen to record the amount of capping imposed on the big cores when these results were obtained? Did you find scenarios where the capacity of the bigs resulted in being lower than the capacity of the LITTLEs (capacity inversion)? This is one case where we'll see a big impact in considering thermal pressure. Also, given that these are more or less sustained workloads, I'm wondering if there is any effect on workloads running on an uncapped system following capping. I would image such a test being composed of a single threaded period (no capping) followed by a multi-threaded period (with capping), continued in a loop. It might be interesting to have something like this as well, as part of your test coverage. Thanks, Ionela.
On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote: > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load > tracking approach. I seem to remember competing proposals, and have forgotten everything about them; the cover letter also didn't have references to them or mention them in any way. As to the averaging and period, I personally prefer a PELT signal with the windows lined up, if that really is too short a window, then a PELT like signal with a natural multiple of the PELT period would make sense, such that the windows still line up nicely. Mixing different averaging methods and non-aligned windows just makes me uncomfortable.
* Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote: > > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load > > tracking approach. > > I seem to remember competing proposals, and have forgotten everything > about them; the cover letter also didn't have references to them or > mention them in any way. > > As to the averaging and period, I personally prefer a PELT signal with > the windows lined up, if that really is too short a window, then a PELT > like signal with a natural multiple of the PELT period would make sense, > such that the windows still line up nicely. > > Mixing different averaging methods and non-aligned windows just makes me > uncomfortable. Yeah, so the problem with PELT is that while it nicely approximates variable-period decay calculations with plain additions, shifts and table lookups (i.e. accelerates pow()), AFAICS the most important decay parameter is fixed: the speed of decay, the dampening factor, which is fixed at 32: Documentation/scheduler/sched-pelt.c #define HALFLIFE 32 Right? Thara's numbers suggest that there's high sensitivity to the speed of decay. By using PELT we'd be using whatever averaging speed there is within PELT. Now we could make that parametric of course, but that would both complicate the PELT lookup code (one more dimension) and would negatively affect code generation in a number of places. Thanks, Ingo
* Ingo Molnar <mingo@kernel.org> wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote: > > > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load > > > tracking approach. > > > > I seem to remember competing proposals, and have forgotten everything > > about them; the cover letter also didn't have references to them or > > mention them in any way. > > > > As to the averaging and period, I personally prefer a PELT signal with > > the windows lined up, if that really is too short a window, then a PELT > > like signal with a natural multiple of the PELT period would make sense, > > such that the windows still line up nicely. > > > > Mixing different averaging methods and non-aligned windows just makes me > > uncomfortable. > > Yeah, so the problem with PELT is that while it nicely approximates > variable-period decay calculations with plain additions, shifts and table > lookups (i.e. accelerates pow()), AFAICS the most important decay > parameter is fixed: the speed of decay, the dampening factor, which is > fixed at 32: > > Documentation/scheduler/sched-pelt.c > > #define HALFLIFE 32 > > Right? > > Thara's numbers suggest that there's high sensitivity to the speed of > decay. By using PELT we'd be using whatever averaging speed there is > within PELT. > > Now we could make that parametric of course, but that would both > complicate the PELT lookup code (one more dimension) and would negatively > affect code generation in a number of places. I missed the other solution, which is what you suggested: by increasing/reducing the PELT window size we can effectively shift decay speed and use just a single lookup table. I.e. instead of the fixed period size of 1024 in accumulate_sum(), use decay_load() directly but use a different (longer) window size from 1024 usecs to calculate 'periods', and make it a multiple of 1024. This might just work out right: with a half-life of 32 the fastest decay speed should be around ~20 msecs (?) - and Thara's numbers so far suggest that the sweet spot averaging is significantly longer, at a couple of hundred millisecs. Thanks, Ingo
On Thu, 25 Apr 2019 at 19:44, Ingo Molnar <mingo@kernel.org> wrote: > > > * Ingo Molnar <mingo@kernel.org> wrote: > > > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote: > > > > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load > > > > tracking approach. > > > > > > I seem to remember competing proposals, and have forgotten everything > > > about them; the cover letter also didn't have references to them or > > > mention them in any way. > > > > > > As to the averaging and period, I personally prefer a PELT signal with > > > the windows lined up, if that really is too short a window, then a PELT > > > like signal with a natural multiple of the PELT period would make sense, > > > such that the windows still line up nicely. > > > > > > Mixing different averaging methods and non-aligned windows just makes me > > > uncomfortable. > > > > Yeah, so the problem with PELT is that while it nicely approximates > > variable-period decay calculations with plain additions, shifts and table > > lookups (i.e. accelerates pow()), AFAICS the most important decay > > parameter is fixed: the speed of decay, the dampening factor, which is > > fixed at 32: > > > > Documentation/scheduler/sched-pelt.c > > > > #define HALFLIFE 32 > > > > Right? > > > > Thara's numbers suggest that there's high sensitivity to the speed of > > decay. By using PELT we'd be using whatever averaging speed there is > > within PELT. > > > > Now we could make that parametric of course, but that would both > > complicate the PELT lookup code (one more dimension) and would negatively > > affect code generation in a number of places. > > I missed the other solution, which is what you suggested: by > increasing/reducing the PELT window size we can effectively shift decay > speed and use just a single lookup table. > > I.e. instead of the fixed period size of 1024 in accumulate_sum(), use > decay_load() directly but use a different (longer) window size from 1024 > usecs to calculate 'periods', and make it a multiple of 1024. Can't we also scale the now parameter of ___update_load_sum() ? If we right shift it before calling ___update_load_sum, it should be the same as using a half period of 62, 128, 256ms ... The main drawback would be a lost of precision but we are in the range of 2, 4, 8us compared to the 1ms window This is quite similar to how we scale the utilization with frequency and uarch > > This might just work out right: with a half-life of 32 the fastest decay > speed should be around ~20 msecs (?) - and Thara's numbers so far suggest > that the sweet spot averaging is significantly longer, at a couple of > hundred millisecs. > > Thanks, > > Ingo
* Vincent Guittot <vincent.guittot@linaro.org> wrote: > On Thu, 25 Apr 2019 at 19:44, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > * Ingo Molnar <mingo@kernel.org> wrote: > > > > > > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote: > > > > > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load > > > > > tracking approach. > > > > > > > > I seem to remember competing proposals, and have forgotten everything > > > > about them; the cover letter also didn't have references to them or > > > > mention them in any way. > > > > > > > > As to the averaging and period, I personally prefer a PELT signal with > > > > the windows lined up, if that really is too short a window, then a PELT > > > > like signal with a natural multiple of the PELT period would make sense, > > > > such that the windows still line up nicely. > > > > > > > > Mixing different averaging methods and non-aligned windows just makes me > > > > uncomfortable. > > > > > > Yeah, so the problem with PELT is that while it nicely approximates > > > variable-period decay calculations with plain additions, shifts and table > > > lookups (i.e. accelerates pow()), AFAICS the most important decay > > > parameter is fixed: the speed of decay, the dampening factor, which is > > > fixed at 32: > > > > > > Documentation/scheduler/sched-pelt.c > > > > > > #define HALFLIFE 32 > > > > > > Right? > > > > > > Thara's numbers suggest that there's high sensitivity to the speed of > > > decay. By using PELT we'd be using whatever averaging speed there is > > > within PELT. > > > > > > Now we could make that parametric of course, but that would both > > > complicate the PELT lookup code (one more dimension) and would negatively > > > affect code generation in a number of places. > > > > I missed the other solution, which is what you suggested: by > > increasing/reducing the PELT window size we can effectively shift decay > > speed and use just a single lookup table. > > > > I.e. instead of the fixed period size of 1024 in accumulate_sum(), use > > decay_load() directly but use a different (longer) window size from 1024 > > usecs to calculate 'periods', and make it a multiple of 1024. > > Can't we also scale the now parameter of ___update_load_sum() ? > If we right shift it before calling ___update_load_sum, it should be > the same as using a half period of 62, 128, 256ms ... > The main drawback would be a lost of precision but we are in the range > of 2, 4, 8us compared to the 1ms window > > This is quite similar to how we scale the utilization with frequency and uarch Yeah, that would work too. Thanks, Ingo
On 04/24/2019 11:57 AM, Ionela Voinescu wrote: > Hi Thara, > > The idea and the results look promising. I'm trying to understand better > the cause of the improvements so I've added below some questions that > would help me out with this. Hi Ionela, Thanks for the review. > > >> Regarding testing, basic build, boot and sanity testing have been >> performed on hikey960 mainline kernel with debian file system. >> Further, aobench (An occlusion renderer for benchmarking realworld >> floating point performance), dhrystone and hackbench test have been >> run with the thermal pressure algorithm. During testing, due to >> constraints of step wise governor in dealing with big little systems, >> cpu cooling was disabled on little core, the idea being that >> big core will heat up and cpu cooling device will throttle the >> frequency of the big cores there by limiting the maximum available >> capacity and the scheduler will spread out tasks to little cores as well. >> Finally, this patch series has been boot tested on db410C running v5.1-rc4 >> kernel. >> > > Did you try using IPA as well? It is better equipped to deal with > big-LITTLE systems and it's more probable IPA will be used for these > systems, where your solution will have the biggest impact as well. > The difference will be that you'll have both the big cluster and the > LITTLE cluster capped in different proportions depending on their > utilization and their efficiency. No. I did not use IPA simply because it was not enabled in mainline. I agree it is better equipped to deal with big-little systems. The idea to remove cpu cooling on little cluster was to in some (not the cleanest) manner to mimic this. But I agree that IPA testing is possibly the next step.Any help in this regard is appreciated. > >> During the course of development various methods of capturing >> and reflecting thermal pressure were implemented. >> >> The first method to be evaluated was to convert the >> capped max frequency into capacity and have the scheduler use the >> instantaneous value when updating cpu_capacity. >> This method is referenced as "Instantaneous Thermal Pressure" in the >> test results below. >> >> The next two methods employs different methods of averaging the >> thermal pressure before applying it when updating cpu_capacity. >> The first of these methods re-used the PELT algorithm already present >> in the kernel that does the averaging of rt and dl load and utilization. >> This method is referenced as "Thermal Pressure Averaging using PELT fmwk" >> in the test results below. >> >> The final method employs an averaging algorithm that collects and >> decays thermal pressure based on the decay period. In this method, >> the decay period is configurable. This method is referenced as >> "Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the >> test results below. >> >> The test results below shows 3-5% improvement in performance when >> using the third solution compared to the default system today where >> scheduler is unware of cpu capacity limitations due to thermal events. >> > > Did you happen to record the amount of capping imposed on the big cores > when these results were obtained? Did you find scenarios where the > capacity of the bigs resulted in being lower than the capacity of the > LITTLEs (capacity inversion)? > This is one case where we'll see a big impact in considering thermal > pressure. I think I saw capacity inversion in some scenarios. I did not particularly capture them. > > Also, given that these are more or less sustained workloads, I'm > wondering if there is any effect on workloads running on an uncapped > system following capping. I would image such a test being composed of a > single threaded period (no capping) followed by a multi-threaded period > (with capping), continued in a loop. It might be interesting to have > something like this as well, as part of your test coverage I do not understand this. There is either capping for a workload or no capping. There is no sysctl entry to turn on or off capping. Regards Thara > > > Thanks, > Ionela. > -- Regards Thara
Hi Thara, >>> Regarding testing, basic build, boot and sanity testing have been >>> performed on hikey960 mainline kernel with debian file system. >>> Further, aobench (An occlusion renderer for benchmarking realworld >>> floating point performance), dhrystone and hackbench test have been >>> run with the thermal pressure algorithm. During testing, due to >>> constraints of step wise governor in dealing with big little systems, >>> cpu cooling was disabled on little core, the idea being that >>> big core will heat up and cpu cooling device will throttle the >>> frequency of the big cores there by limiting the maximum available >>> capacity and the scheduler will spread out tasks to little cores as well. >>> Finally, this patch series has been boot tested on db410C running v5.1-rc4 >>> kernel. >>> >> >> Did you try using IPA as well? It is better equipped to deal with >> big-LITTLE systems and it's more probable IPA will be used for these >> systems, where your solution will have the biggest impact as well. >> The difference will be that you'll have both the big cluster and the >> LITTLE cluster capped in different proportions depending on their >> utilization and their efficiency. > > No. I did not use IPA simply because it was not enabled in mainline. I > agree it is better equipped to deal with big-little systems. The idea > to remove cpu cooling on little cluster was to in some (not the > cleanest) manner to mimic this. But I agree that IPA testing is possibly > the next step.Any help in this regard is appreciated. > I see CONFIG_THERMAL_GOV_POWER_ALLOCATOR=y in the defconfig for arm64: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/configs/defconfig?h=v5.1-rc6#n413 You can enable the use of it or make it default in the defconfig. Also, Hikey960 has the needed setup in DT: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/hisilicon/hi3660.dtsi?h=v5.1-rc6#n1093 This should work fine. >> >>> During the course of development various methods of capturing >>> and reflecting thermal pressure were implemented. >>> >>> The first method to be evaluated was to convert the >>> capped max frequency into capacity and have the scheduler use the >>> instantaneous value when updating cpu_capacity. >>> This method is referenced as "Instantaneous Thermal Pressure" in the >>> test results below. >>> >>> The next two methods employs different methods of averaging the >>> thermal pressure before applying it when updating cpu_capacity. >>> The first of these methods re-used the PELT algorithm already present >>> in the kernel that does the averaging of rt and dl load and utilization. >>> This method is referenced as "Thermal Pressure Averaging using PELT fmwk" >>> in the test results below. >>> >>> The final method employs an averaging algorithm that collects and >>> decays thermal pressure based on the decay period. In this method, >>> the decay period is configurable. This method is referenced as >>> "Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the >>> test results below. >>> >>> The test results below shows 3-5% improvement in performance when >>> using the third solution compared to the default system today where >>> scheduler is unware of cpu capacity limitations due to thermal events. >>> >> >> Did you happen to record the amount of capping imposed on the big cores >> when these results were obtained? Did you find scenarios where the >> capacity of the bigs resulted in being lower than the capacity of the >> LITTLEs (capacity inversion)? >> This is one case where we'll see a big impact in considering thermal >> pressure. > > I think I saw capacity inversion in some scenarios. I did not > particularly capture them. > It would be good to observe this and possibly correlate the amount of capping with resulting behavior and performance numbers. This would give more confidence in the testing coverage. You can create a specific testcase for capacity inversion by only capping the big CPUs, as you've done for these tests, and by running sysbench/dhrystone for example with at least nr_big_cpus tasks. This assumes that the bigs fully utilized would generate enough heat and would be capped enough to achieve a capacity lower than the littles, which on Hikey960 I don't doubt it can be obtained. >> >> Also, given that these are more or less sustained workloads, I'm >> wondering if there is any effect on workloads running on an uncapped >> system following capping. I would image such a test being composed of a >> single threaded period (no capping) followed by a multi-threaded period >> (with capping), continued in a loop. It might be interesting to have >> something like this as well, as part of your test coverage > > I do not understand this. There is either capping for a workload or no > capping. There is no sysctl entry to turn on or off capping. > I was thinking of this as a second hand effect. If you have only one big CPU even fully utilized, with the others quiet, you might not see any capping. But when you have a multi-threaded workload, with all or at least the bigs at a high OPP, the platform will definitely overheat and there will be capping. Thanks, Ionela. > Regards > Thara >> >> >> Thanks, >> Ionela. >> > >
Hi Thara, > > Hackbench: (1 group , 30000 loops, 10 runs) > Result Standard Deviation > (Time Secs) (% of mean) > > No Thermal Pressure 10.21 7.99% > > Instantaneous thermal pressure 10.16 5.36% > > Thermal Pressure Averaging > using PELT fmwk 9.88 3.94% > > Thermal Pressure Averaging > non-PELT Algo. Decay : 500 ms 9.94 4.59% > > Thermal Pressure Averaging > non-PELT Algo. Decay : 250 ms 7.52 5.42% > > Thermal Pressure Averaging > non-PELT Algo. Decay : 125 ms 9.87 3.94% > > I'm trying your patches on my Hikey960 and I'm getting different results than the ones here. I'm running with the step-wise governor, enabled only on the big cores. The decay period is set to 250ms. The result for hackbench is: # ./hackbench -g 1 -l 30000 Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) Each sender will pass 30000 messages of 100 bytes Time: 20.756 During the run I see the little cores running at maximum frequency (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped at 1.42GHz. There should not be any capacity inversion. The temperature is kept around 75 degrees (73 to 77 degrees). I don't have any kind of active cooling (no fans on the board), only a heatsink on the SoC. But as you see my results(~20s) are very far from the 7-10s in your results. Do you see anything wrong with this process? Can you give me more details on your setup that I can use to test on my board? Thank you, Ionela.
Hi Thara, On 29/04/2019 14:29, Ionela Voinescu wrote: > Hi Thara, > >> >> Hackbench: (1 group , 30000 loops, 10 runs) >> Result Standard Deviation >> (Time Secs) (% of mean) >> >> No Thermal Pressure 10.21 7.99% >> >> Instantaneous thermal pressure 10.16 5.36% >> >> Thermal Pressure Averaging >> using PELT fmwk 9.88 3.94% >> >> Thermal Pressure Averaging >> non-PELT Algo. Decay : 500 ms 9.94 4.59% >> >> Thermal Pressure Averaging >> non-PELT Algo. Decay : 250 ms 7.52 5.42% >> >> Thermal Pressure Averaging >> non-PELT Algo. Decay : 125 ms 9.87 3.94% >> >> > > I'm trying your patches on my Hikey960 and I'm getting different results > than the ones here. > > I'm running with the step-wise governor, enabled only on the big cores. > The decay period is set to 250ms. > > The result for hackbench is: > > # ./hackbench -g 1 -l 30000 > Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > Each sender will pass 30000 messages of 100 bytes > Time: 20.756 > > During the run I see the little cores running at maximum frequency > (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped > at 1.42GHz. There should not be any capacity inversion. > The temperature is kept around 75 degrees (73 to 77 degrees). > > I don't have any kind of active cooling (no fans on the board), only a > heatsink on the SoC. > > But as you see my results(~20s) are very far from the 7-10s in your > results. > > Do you see anything wrong with this process? Can you give me more > details on your setup that I can use to test on my board? > I've found that my poor results above were due to debug options mistakenly left enabled in the defconfig. Sorry about that! After cleaning it up I'm getting results around 5.6s for this test case. I've run 50 iterations for each test, with 90s cool down period between them. Hackbench: (1 group , 30000 loops, 50 runs) Result Standard Deviation (Time Secs) (% of mean) No Thermal Pressure(step_wise) 5.644 7.760% No Thermal Pressure(IPA) 5.677 9.062% Thermal Pressure Averaging non-PELT Algo. Decay : 250 ms 5.627 5.593% (step-wise, bigs capped only) Thermal Pressure Averaging non-PELT Algo. Decay : 250 ms 5.690 3.738% (IPA) All of the results above are within 1.1% difference with a significantly higher standard deviation. I wanted to run this initially to validate my setup and understand if there is any conclusion we can draw from a test like this, that floods the CPUs with tasks. Looking over the traces, the tasks are running almost back to back, trying to use all available resources, on all the CPUs. Therefore, I doubt that there could be better decisions that could be made, knowing about thermal pressure, for this usecase. I'll try next some capacity inversion usecase and post the results when they are ready. Hope it helps, Ionela. > Thank you, > Ionela. >
On 04/29/2019 09:29 AM, Ionela Voinescu wrote: > Hi Thara, > >> >> Hackbench: (1 group , 30000 loops, 10 runs) >> Result Standard Deviation >> (Time Secs) (% of mean) >> >> No Thermal Pressure 10.21 7.99% >> >> Instantaneous thermal pressure 10.16 5.36% >> >> Thermal Pressure Averaging >> using PELT fmwk 9.88 3.94% >> >> Thermal Pressure Averaging >> non-PELT Algo. Decay : 500 ms 9.94 4.59% >> >> Thermal Pressure Averaging >> non-PELT Algo. Decay : 250 ms 7.52 5.42% >> >> Thermal Pressure Averaging >> non-PELT Algo. Decay : 125 ms 9.87 3.94% >> >> > > I'm trying your patches on my Hikey960 and I'm getting different results > than the ones here. > > I'm running with the step-wise governor, enabled only on the big cores. > The decay period is set to 250ms. > > The result for hackbench is: > > # ./hackbench -g 1 -l 30000 > Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > Each sender will pass 30000 messages of 100 bytes > Time: 20.756 > > During the run I see the little cores running at maximum frequency > (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped > at 1.42GHz. There should not be any capacity inversion. > The temperature is kept around 75 degrees (73 to 77 degrees). > > I don't have any kind of active cooling (no fans on the board), only a > heatsink on the SoC. > > But as you see my results(~20s) are very far from the 7-10s in your > results. > > Do you see anything wrong with this process? Can you give me more > details on your setup that I can use to test on my board? Hi Ionela, I used the latest mainline kernel with sched/ tip merged in for my testing. My hikey960 did not have any fan or heat sink during testing. I disabled cpu cooling for little cores in the dts files. Also I have to warn you that I have managed to blow up my hikey960. So I no longer have a functional board for past two weeks or so. I don't have my test scripts to send you, but I have some of the results files downloaded which I can send you in a separate email. I did run the test 10 rounds. Also I think 20s is too much of variation for the test results. Like I mentioned in my previous emails I think the 7.52 is an anomaly but the results should be around the range of 8-9 s. Regards Thara > > Thank you, > Ionela. > -- Regards Thara
On 04/30/2019 11:57 AM, Thara Gopinath wrote: > On 04/29/2019 09:29 AM, Ionela Voinescu wrote: >> Hi Thara, >> >>> >>> Hackbench: (1 group , 30000 loops, 10 runs) >>> Result Standard Deviation >>> (Time Secs) (% of mean) >>> >>> No Thermal Pressure 10.21 7.99% >>> >>> Instantaneous thermal pressure 10.16 5.36% >>> >>> Thermal Pressure Averaging >>> using PELT fmwk 9.88 3.94% >>> >>> Thermal Pressure Averaging >>> non-PELT Algo. Decay : 500 ms 9.94 4.59% >>> >>> Thermal Pressure Averaging >>> non-PELT Algo. Decay : 250 ms 7.52 5.42% >>> >>> Thermal Pressure Averaging >>> non-PELT Algo. Decay : 125 ms 9.87 3.94% >>> >>> >> >> I'm trying your patches on my Hikey960 and I'm getting different results >> than the ones here. >> >> I'm running with the step-wise governor, enabled only on the big cores. >> The decay period is set to 250ms. >> >> The result for hackbench is: >> >> # ./hackbench -g 1 -l 30000 >> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) >> Each sender will pass 30000 messages of 100 bytes >> Time: 20.756 >> >> During the run I see the little cores running at maximum frequency >> (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped >> at 1.42GHz. There should not be any capacity inversion. >> The temperature is kept around 75 degrees (73 to 77 degrees). >> >> I don't have any kind of active cooling (no fans on the board), only a >> heatsink on the SoC. >> >> But as you see my results(~20s) are very far from the 7-10s in your >> results. >> >> Do you see anything wrong with this process? Can you give me more >> details on your setup that I can use to test on my board? > > Hi Ionela, > > I used the latest mainline kernel with sched/ tip merged in for my > testing. My hikey960 did not have any fan or heat sink during testing. I > disabled cpu cooling for little cores in the dts files. > Also I have to warn you that I have managed to blow up my hikey960. So I > no longer have a functional board for past two weeks or so. > > I don't have my test scripts to send you, but I have some of the results > files downloaded which I can send you in a separate email. > I did run the test 10 rounds. Hi Ionela, I failed to mention that I drop the first run for averaging. > > Also I think 20s is too much of variation for the test results. Like I > mentioned in my previous emails I think the 7.52 is an anomaly but the > results should be around the range of 8-9 s. Also since we are more interested in comparison rather than absolute numbers did you run tests in a system with no thermal pressure( to see if there are any improvements)? Regards Thara > > Regards > Thara > >> >> Thank you, >> Ionela. >> > > -- Regards Thara
On 04/30/2019 10:39 AM, Ionela Voinescu wrote: > Hi Thara, > > On 29/04/2019 14:29, Ionela Voinescu wrote: >> Hi Thara, >> >>> >>> Hackbench: (1 group , 30000 loops, 10 runs) >>> Result Standard Deviation >>> (Time Secs) (% of mean) >>> >>> No Thermal Pressure 10.21 7.99% >>> >>> Instantaneous thermal pressure 10.16 5.36% >>> >>> Thermal Pressure Averaging >>> using PELT fmwk 9.88 3.94% >>> >>> Thermal Pressure Averaging >>> non-PELT Algo. Decay : 500 ms 9.94 4.59% >>> >>> Thermal Pressure Averaging >>> non-PELT Algo. Decay : 250 ms 7.52 5.42% >>> >>> Thermal Pressure Averaging >>> non-PELT Algo. Decay : 125 ms 9.87 3.94% >>> >>> >> >> I'm trying your patches on my Hikey960 and I'm getting different results >> than the ones here. >> >> I'm running with the step-wise governor, enabled only on the big cores. >> The decay period is set to 250ms. >> >> The result for hackbench is: >> >> # ./hackbench -g 1 -l 30000 >> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) >> Each sender will pass 30000 messages of 100 bytes >> Time: 20.756 >> >> During the run I see the little cores running at maximum frequency >> (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped >> at 1.42GHz. There should not be any capacity inversion. >> The temperature is kept around 75 degrees (73 to 77 degrees). >> >> I don't have any kind of active cooling (no fans on the board), only a >> heatsink on the SoC. >> >> But as you see my results(~20s) are very far from the 7-10s in your >> results. >> >> Do you see anything wrong with this process? Can you give me more >> details on your setup that I can use to test on my board? >> > > I've found that my poor results above were due to debug options > mistakenly left enabled in the defconfig. Sorry about that! > > After cleaning it up I'm getting results around 5.6s for this test case. > I've run 50 iterations for each test, with 90s cool down period between > them. > > > Hackbench: (1 group , 30000 loops, 50 runs) > Result Standard Deviation > (Time Secs) (% of mean) > > No Thermal Pressure(step_wise) 5.644 7.760% > No Thermal Pressure(IPA) 5.677 9.062% > > Thermal Pressure Averaging > non-PELT Algo. Decay : 250 ms 5.627 5.593% > (step-wise, bigs capped only) > > Thermal Pressure Averaging > non-PELT Algo. Decay : 250 ms 5.690 3.738% > (IPA) > > All of the results above are within 1.1% difference with a > significantly higher standard deviation. Hi Ionela, I have replied to your original emails without seeing this one. So, interesting results. I see IPA is worse off (Slightly) than step wise in both thermal pressure and non-thermal pressure scenarios. Did you try 500 ms decay period by any chance? > > I wanted to run this initially to validate my setup and understand > if there is any conclusion we can draw from a test like this, that > floods the CPUs with tasks. Looking over the traces, the tasks are > running almost back to back, trying to use all available resources, > on all the CPUs. > Therefore, I doubt that there could be better decisions that could be > made, knowing about thermal pressure, for this usecase. > > I'll try next some capacity inversion usecase and post the results when > they are ready. Sure. let me know if I can help. Regards Thara > > Hope it helps, > Ionela. > > >> Thank you, >> Ionela. >> -- Regards Thara
Hi Thara, >> After cleaning it up I'm getting results around 5.6s for this test case. >> I've run 50 iterations for each test, with 90s cool down period between >> them. >> >> >> Hackbench: (1 group , 30000 loops, 50 runs) >> Result Standard Deviation >> (Time Secs) (% of mean) >> >> No Thermal Pressure(step_wise) 5.644 7.760% >> No Thermal Pressure(IPA) 5.677 9.062% >> >> Thermal Pressure Averaging >> non-PELT Algo. Decay : 250 ms 5.627 5.593% >> (step-wise, bigs capped only) >> >> Thermal Pressure Averaging >> non-PELT Algo. Decay : 250 ms 5.690 3.738% >> (IPA) >> >> All of the results above are within 1.1% difference with a >> significantly higher standard deviation. > > Hi Ionela, > > I have replied to your original emails without seeing this one. So, > interesting results. I see IPA is worse off (Slightly) than step wise in > both thermal pressure and non-thermal pressure scenarios. Did you try > 500 ms decay period by any chance? > I don't think we can draw a conclusion on that given how close the results are and given the high standard deviation. Probably if I run them again the tables will be turned :). I did not run experiments with different decay periods yet, as I want to have first a list of experiments that are relevant for thermal pressure, that can help later with refining the solution, which can mean either deciding on a decay period or possibly going with the instantaneous thermal pressure. Please find more details below. >> >> I wanted to run this initially to validate my setup and understand >> if there is any conclusion we can draw from a test like this, that >> floods the CPUs with tasks. Looking over the traces, the tasks are >> running almost back to back, trying to use all available resources, >> on all the CPUs. >> Therefore, I doubt that there could be better decisions that could be >> made, knowing about thermal pressure, for this usecase. >> >> I'll try next some capacity inversion usecase and post the results when >> they are ready. > I've started looking into this, starting from the most obvious case of capacity inversion: using the user-space thermal governor and capping the bigs to their lowest OPP. The LITTLEs are left uncapped. This was not enough on the Hikey960 as the bigs at their lowest OPP were in the capacity margin of the LITTLEs at their highest OPP. That meant that LITTLEs would not pull tasks from the bigs, even if they had higher capacity, as the capacity was in within the 25% margin. So another change I've made was to set the capacity margin in fair.c to 10%. I've run both sysbench and dhrystone. I'll put here only the results for sysbench, interleaved, with and without considering thermal pressure (TP and !TP). As before, the TP solution uses averaging with a 250ms decay period. Sysbench: (500000 req, 4 runs) Result Standard Deviation (Time Secs) (% of mean) !TP/4 threads 146.46 0.063% TP/4 threads 136.36 0.002% !TP/5 threads 115.38 0.028% TP/5 threads 110.62 0.006% !TP/6 threads 95.38 0.051% TP/6 threads 93.07 0.054% !TP/7 threads 81.19 0.012% TP/7 threads 80.32 0.028% !TP/8 threads 72.58 2.295% TP/8 threads 71.37 0.044% As expected, the results are significantly improved when the scheduler is let know of reduced capacity on the bigs which results in tasks being placed or migrated to the littles which are able to provide better performance. Traces nicely confirm this. To be noted that these results only show that reflecting thermal pressure in the capacity of the CPUs is useful and the scheduler is equipped to make proper use of this information. Possibly a thing to consider is whether or not to reduce the capacity margin, but that's for another discussion. This does not reflect the benefits of averaging, as, with the bigs always being capped to the lowest OPP, the thermal pressure value will be constant over the duration of the workload. The same results would have been obtained with instantaneous thermal pressure. Secondly, I've tried to use the step-wise governor, modified to only cap the big CPUs, with the intention to obtain smaller periods of capacity inversion for which a thermal pressure solution would show its benefits. Unfortunately dhrystone was misbehaving for some reason and was giving me a high variation between results for the same test case. Also, sysbench, ran with the same arguments as above, was not creating enough load and thermal capping as to show the benefits of considering thermal pressure. So my recommendation is continue exploiting more test-cases like these. I would continue with sysbench as it looks more stable, but modify the the temperature threshold to determine periods of drastic capping of the bigs. Once a dynamic test case and setup like this (no fixing frequencies) is identified, it can be used to understand if averaging is needed and to refine the decay period, and establish a good default. What do you think? Does this make sense as a direction for obtaining test cases? In my opinion the previous test cases were not triggering the right behaviors that can help prove the need for thermal pressure, or help refine it. I will try to continue in this direction, but I won't be able to get to in for a few days. You'll find more results at: https://docs.google.com/spreadsheets/d/1ibxDSSSLTodLzihNAw6jM36eVZABuPMMnjvV-Xh4NEo/edit?usp=sharing > Sure. let me know if I can help. Any test results or recommendations for test cases would be helpful. The need for thermal pressure is obvious, but the way that thermal pressure is reflected in the capacity of the CPUs could be supported by more thorough testing. Regards, Ionela. > > Regards > Thara > >> >> Hope it helps, >> Ionela. >> >> >>> Thank you, >>> Ionela. >>> > >