diff mbox series

[RFT,v1] cpuidle: teo: Avoid selecting deepest idle state over-eagerly

Message ID 12630185.O9o76ZdvQC@rjwysocki.net
State New
Headers show
Series [RFT,v1] cpuidle: teo: Avoid selecting deepest idle state over-eagerly | expand

Commit Message

Rafael J. Wysocki Feb. 4, 2025, 8:58 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

It has been observed that the recent teo governor update which concluded
with commit 16c8d7586c19 ("cpuidle: teo: Skip sleep length computation
for low latency constraints") caused the max-jOPS score of the SPECjbb
2015 benchmark [1] on Intel Granite Rapids to decrease by around 1.4%.
While it may be argued that this is not a significant increase, the
previous score can be restored by tweaking the inequality used by teo
to decide whether or not to preselect the deepest enabled idle state.
That change also causes the critical-jOPS score of SPECjbb to increase
by around 2%.

Namely, the likelihood of selecting the deepest enabled idle state in
teo on the platform in question has increased after commit 13ed5c4a6d9c
("cpuidle: teo: Skip getting the sleep length if wakeups are very
frequent") because some timer wakeups were previously counted as non-
timer ones and they were effectively added to the left-hand side of the
inequality deciding whether or not to preselect the deepest idle state.

Many of them are now (accurately) counted as timer wakeups, so the left-
hand side of that inequality is now effectively smaller in some cases,
especially when timer wakeups often occur in the range below the target
residency of the deepest enabled idle state and idle states with target
residencies below CPUIDLE_FLAG_POLLING are often selected, but the
majority of recent idle intervals are still above that value most of
the time.  As a result, the deepest enabled idle state may be selected
more often than it used to be selected in some cases.

To counter that effect, add the sum of the hits metric for all of the
idle states below the candidate one (which is the deepest enabled idle
state at that point) to the left-hand side of the inequality mentioned
above.  This will cause it to be more balanced because, in principle,
putting both timer and non-timer wakeups on both sides of it is more
consistent than only taking into account the timer wakeups in the range
above the target residency of the deepest enabled idle state.

Link: https://www.spec.org/jbb2015/
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/teo.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Rafael J. Wysocki Feb. 6, 2025, 2:37 p.m. UTC | #1
On Tue, Feb 4, 2025 at 9:58 PM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> It has been observed that the recent teo governor update which concluded
> with commit 16c8d7586c19 ("cpuidle: teo: Skip sleep length computation
> for low latency constraints") caused the max-jOPS score of the SPECjbb
> 2015 benchmark [1] on Intel Granite Rapids to decrease by around 1.4%.
> While it may be argued that this is not a significant increase, the
> previous score can be restored by tweaking the inequality used by teo
> to decide whether or not to preselect the deepest enabled idle state.
> That change also causes the critical-jOPS score of SPECjbb to increase
> by around 2%.
>
> Namely, the likelihood of selecting the deepest enabled idle state in
> teo on the platform in question has increased after commit 13ed5c4a6d9c
> ("cpuidle: teo: Skip getting the sleep length if wakeups are very
> frequent") because some timer wakeups were previously counted as non-
> timer ones and they were effectively added to the left-hand side of the
> inequality deciding whether or not to preselect the deepest idle state.
>
> Many of them are now (accurately) counted as timer wakeups, so the left-
> hand side of that inequality is now effectively smaller in some cases,
> especially when timer wakeups often occur in the range below the target
> residency of the deepest enabled idle state and idle states with target
> residencies below CPUIDLE_FLAG_POLLING are often selected, but the
> majority of recent idle intervals are still above that value most of
> the time.  As a result, the deepest enabled idle state may be selected
> more often than it used to be selected in some cases.
>
> To counter that effect, add the sum of the hits metric for all of the
> idle states below the candidate one (which is the deepest enabled idle
> state at that point) to the left-hand side of the inequality mentioned
> above.  This will cause it to be more balanced because, in principle,
> putting both timer and non-timer wakeups on both sides of it is more
> consistent than only taking into account the timer wakeups in the range
> above the target residency of the deepest enabled idle state.
>
> Link: https://www.spec.org/jbb2015/
> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpuidle/governors/teo.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -349,13 +349,13 @@
>         }
>
>         /*
> -        * If the sum of the intercepts metric for all of the idle states
> -        * shallower than the current candidate one (idx) is greater than the
> +        * If the sum of the intercepts and hits metric for all of the idle
> +        * states below the current candidate one (idx) is greater than the
>          * sum of the intercepts and hits metrics for the candidate state and
>          * all of the deeper states, a shallower idle state is likely to be a
>          * better choice.
>          */
> -       if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
> +       if (2 * (idx_intercept_sum + idx_hit_sum) > cpu_data->total) {
>                 int first_suitable_idx = idx;
>
>                 /*

For easier reference/testing this has been exposed in the git branch at

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/teo-tweak

on top of the cpuidle material that went into 6.14-rc1.
Doug Smythies Feb. 7, 2025, 11:40 p.m. UTC | #2
Hi Rafael,

On 2025.02.04 12:58 Rafael J. Wysocki wrote:
>
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> It has been observed that the recent teo governor update which concluded
> with commit 16c8d7586c19 ("cpuidle: teo: Skip sleep length computation
> for low latency constraints") caused the max-jOPS score of the SPECjbb
> 2015 benchmark [1] on Intel Granite Rapids to decrease by around 1.4%.
> While it may be argued that this is not a significant increase, the
> previous score can be restored by tweaking the inequality used by teo
> to decide whether or not to preselect the deepest enabled idle state.
> That change also causes the critical-jOPS score of SPECjbb to increase
> by around 2%.
>
> Namely, the likelihood of selecting the deepest enabled idle state in
> teo on the platform in question has increased after commit 13ed5c4a6d9c
> ("cpuidle: teo: Skip getting the sleep length if wakeups are very
> frequent") because some timer wakeups were previously counted as non-
> timer ones and they were effectively added to the left-hand side of the
> inequality deciding whether or not to preselect the deepest idle state.
>
> Many of them are now (accurately) counted as timer wakeups, so the left-
> hand side of that inequality is now effectively smaller in some cases,
> especially when timer wakeups often occur in the range below the target
> residency of the deepest enabled idle state and idle states with target
> residencies below CPUIDLE_FLAG_POLLING are often selected, but the
> majority of recent idle intervals are still above that value most of
> the time.  As a result, the deepest enabled idle state may be selected
> more often than it used to be selected in some cases.
>
> To counter that effect, add the sum of the hits metric for all of the
> idle states below the candidate one (which is the deepest enabled idle
> state at that point) to the left-hand side of the inequality mentioned
> above.  This will cause it to be more balanced because, in principle,
> putting both timer and non-timer wakeups on both sides of it is more
> consistent than only taking into account the timer wakeups in the range
> above the target residency of the deepest enabled idle state.
>
> Link: https://www.spec.org/jbb2015/
> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpuidle/governors/teo.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -349,13 +349,13 @@
>         }
>
>         /*
> -        * If the sum of the intercepts metric for all of the idle states
> -        * shallower than the current candidate one (idx) is greater than the
> +        * If the sum of the intercepts and hits metric for all of the idle
> +        * states below the current candidate one (idx) is greater than the
>          * sum of the intercepts and hits metrics for the candidate state and
>          * all of the deeper states, a shallower idle state is likely to be a
>          * better choice.
>          */
> -       if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
> +       if (2 * (idx_intercept_sum + idx_hit_sum) > cpu_data->total) {
>                 int first_suitable_idx = idx;
>
>                 /*

I have only just started testing the recent idle governor changes,
and have not gotten very far yet.

There is a significant increase in processor package power during idle
with this patch, about 5 times increase (400%).

My processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
Distro: Ubuntu 24.04.1, server, no desktop GUI.
CPU frequency scaling driver: intel_pstate
HWP: disabled.
CPU frequency scaling governor: performance

Idle states:
$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3_ACPI

Test durations were >= 45 minutes each.

Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
Average Idle Power: teo governor: 2.199 watts (+25.51%)
Average Idle power: menu governor: 1.873 watts (+6.91%)

Kernel 6.14-rc1-p: Added this patch for teo and "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals" for menu
Average Idle Power: teo governor: 9.401 watts (+436.6%)
Only 69.61% idle is in the deepest idle state. More typically it would be 98% to 99%.
28.6531% idle time is in state 1. More typically it would be 0.03%
Average Idle Power: menu governor: 1.820 watts (+3.9%)

Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
Average Idle Power: teo governor: 1.752 watts (reference: 0.0%)
Average Idle power: menu governor: 1.909 watts (+9.0%)

... Doug
Rafael J. Wysocki Feb. 8, 2025, 11:24 a.m. UTC | #3
Hi Doug,

On Sat, Feb 8, 2025 at 12:40 AM Doug Smythies <dsmythies@telus.net> wrote:
>
> Hi Rafael,
>
> On 2025.02.04 12:58 Rafael J. Wysocki wrote:
> >
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > It has been observed that the recent teo governor update which concluded
> > with commit 16c8d7586c19 ("cpuidle: teo: Skip sleep length computation
> > for low latency constraints") caused the max-jOPS score of the SPECjbb
> > 2015 benchmark [1] on Intel Granite Rapids to decrease by around 1.4%.
> > While it may be argued that this is not a significant increase, the
> > previous score can be restored by tweaking the inequality used by teo
> > to decide whether or not to preselect the deepest enabled idle state.
> > That change also causes the critical-jOPS score of SPECjbb to increase
> > by around 2%.
> >
> > Namely, the likelihood of selecting the deepest enabled idle state in
> > teo on the platform in question has increased after commit 13ed5c4a6d9c
> > ("cpuidle: teo: Skip getting the sleep length if wakeups are very
> > frequent") because some timer wakeups were previously counted as non-
> > timer ones and they were effectively added to the left-hand side of the
> > inequality deciding whether or not to preselect the deepest idle state.
> >
> > Many of them are now (accurately) counted as timer wakeups, so the left-
> > hand side of that inequality is now effectively smaller in some cases,
> > especially when timer wakeups often occur in the range below the target
> > residency of the deepest enabled idle state and idle states with target
> > residencies below CPUIDLE_FLAG_POLLING are often selected, but the
> > majority of recent idle intervals are still above that value most of
> > the time.  As a result, the deepest enabled idle state may be selected
> > more often than it used to be selected in some cases.
> >
> > To counter that effect, add the sum of the hits metric for all of the
> > idle states below the candidate one (which is the deepest enabled idle
> > state at that point) to the left-hand side of the inequality mentioned
> > above.  This will cause it to be more balanced because, in principle,
> > putting both timer and non-timer wakeups on both sides of it is more
> > consistent than only taking into account the timer wakeups in the range
> > above the target residency of the deepest enabled idle state.
> >
> > Link: https://www.spec.org/jbb2015/
> > Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  drivers/cpuidle/governors/teo.c |    6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > --- a/drivers/cpuidle/governors/teo.c
> > +++ b/drivers/cpuidle/governors/teo.c
> > @@ -349,13 +349,13 @@
> >         }
> >
> >         /*
> > -        * If the sum of the intercepts metric for all of the idle states
> > -        * shallower than the current candidate one (idx) is greater than the
> > +        * If the sum of the intercepts and hits metric for all of the idle
> > +        * states below the current candidate one (idx) is greater than the
> >          * sum of the intercepts and hits metrics for the candidate state and
> >          * all of the deeper states, a shallower idle state is likely to be a
> >          * better choice.
> >          */
> > -       if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
> > +       if (2 * (idx_intercept_sum + idx_hit_sum) > cpu_data->total) {
> >                 int first_suitable_idx = idx;
> >
> >                 /*
>
> I have only just started testing the recent idle governor changes,
> and have not gotten very far yet.
>
> There is a significant increase in processor package power during idle
> with this patch, about 5 times increase (400%).

Thanks for this data point!

The restoration of the 1.4% benchmark score is not worth this cost IMV.

> My processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> Distro: Ubuntu 24.04.1, server, no desktop GUI.
> CPU frequency scaling driver: intel_pstate
> HWP: disabled.
> CPU frequency scaling governor: performance
>
> Idle states:
> $ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI
> /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2_ACPI
> /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3_ACPI
>
> Test durations were >= 45 minutes each.
>
> Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
> Average Idle Power: teo governor: 2.199 watts (+25.51%)
> Average Idle power: menu governor: 1.873 watts (+6.91%)

menu hasn't changed in 6.14-rc1, so this would be variation between
runs I suppose.

> Kernel 6.14-rc1-p: Added this patch for teo and "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals" for menu
> Average Idle Power: teo governor: 9.401 watts (+436.6%)
> Only 69.61% idle is in the deepest idle state. More typically it would be 98% to 99%.

Ah, not good.

OK, this clearly doesn't go in the right direction.

> 28.6531% idle time is in state 1. More typically it would be 0.03%
> Average Idle Power: menu governor: 1.820 watts (+3.9%)
>
> Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
> Average Idle Power: teo governor: 1.752 watts (reference: 0.0%)
> Average Idle power: menu governor: 1.909 watts (+9.0%)

Thanks, I'm just going to drop this patch then.

If you don't mind, I'll have a couple more teo updates for testing.
Artem Bityutskiy Feb. 9, 2025, 9:24 a.m. UTC | #4
On Fri, 2025-02-07 at 15:40 -0800, Doug Smythies wrote:
> I have only just started testing the recent idle governor changes,
> and have not gotten very far yet.

Hi Dough,

there is the menu governor patch too. While it helps the server, I did not test
it on a client system. May be you would to check it too? Subject is:

[RFT][PATCH v1 0/5] cpuidle: menu: Avoid discarding useful information when processing recent idle intervals

Thanks!
Doug Smythies Feb. 10, 2025, 3:17 p.m. UTC | #5
Hi Artem,

On 2025.02.09 01:15 : Artem Bityutskiy wrote:
> On Fri, 2025-02-07 at 15:40 -0800, Doug Smythies wrote:
>> I have only just started testing the recent idle governor changes,
>> and have not gotten very far yet.
>
> Hi Dough,
>
> there is the menu governor patch too. While it helps the server, I did not test
> it on a client system. May be you would to check it too? Subject is:
>
> [RFT][PATCH v1 0/5] cpuidle: menu: Avoid discarding useful information when processing recent idle intervals

Yes, I have included that patch set in my kernel 6.14-rc1 plus patches under test.
Other than the idle data, I don't have any other data to present yet as I have been
acquiring baseline data.
Doug Smythies Feb. 10, 2025, 3:17 p.m. UTC | #6
On 2025.02.08 03:25 Rafael J. Wysocki wrote:
>On Sat, Feb 8, 2025 at 12:40 AM Doug Smythies <dsmythies@telus.net> wrote:
>> On 2025.02.04 12:58 Rafael J. Wysocki wrote:
>>>
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

... snip ...
>>
>> Test durations were >= 45 minutes each.
>>
>> Kernel 6.14-rc1: Includes cpuidle: teo: Cleanups and very frequent wakeups handling update
>> Average Idle Power: teo governor: 2.199 watts (+25.51%)
>> Average Idle power: menu governor: 1.873 watts (+6.91%)
>
> menu hasn't changed in 6.14-rc1, so this would be variation between
>runs I suppose.

Perhaps the way I presented my data wasn't the best method.
All the % were relative to kernel 6.13 and the teo idle governor.
Relative to kernel 6.13 menu governor that would be: -1.9%,
certainly within the noise floor.
 
>> Kernel 6.14-rc1-p: Added this patch for teo and "cpuidle: menu: Avoid discarding useful information when processing recent idle intervals" for menu
>> Average Idle Power: teo governor: 9.401 watts (+436.6%)
>> Only 69.61% idle is in the deepest idle state. More typically it would be 98% to 99%.
>
> Ah, not good.
>
> OK, this clearly doesn't go in the right direction.
>
>> 28.6531% idle time is in state 1. More typically it would be 0.03%
>> Average Idle Power: menu governor: 1.820 watts (+3.9%)
>>
>> Kernel 6.13: before "cpuidle: teo: Cleanups and very frequent wakeups handling update"
>> Average Idle Power: teo governor: 1.752 watts (reference: 0.0%)
>> Average Idle power: menu governor: 1.909 watts (+9.0%)
>
> Thanks, I'm just going to drop this patch then.
>
> If you don't mind, I'll have a couple more teo updates for testing.

O.K. I'll watch for the patches.
Plenty to do in the meantime, acquiring baseline data and continuing with the:
"cpuidle: menu: Avoid discarding useful information when processing recent idle intervals"
patch set portion of kernel 6.14-rc1-p
Christian Loehle Feb. 13, 2025, 2:07 p.m. UTC | #7
On 2/4/25 20:58, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> It has been observed that the recent teo governor update which concluded
> with commit 16c8d7586c19 ("cpuidle: teo: Skip sleep length computation
> for low latency constraints") caused the max-jOPS score of the SPECjbb
> 2015 benchmark [1] on Intel Granite Rapids to decrease by around 1.4%.
> While it may be argued that this is not a significant increase, the
> previous score can be restored by tweaking the inequality used by teo
> to decide whether or not to preselect the deepest enabled idle state.
> That change also causes the critical-jOPS score of SPECjbb to increase
> by around 2%.
> 
> Namely, the likelihood of selecting the deepest enabled idle state in
> teo on the platform in question has increased after commit 13ed5c4a6d9c
> ("cpuidle: teo: Skip getting the sleep length if wakeups are very
> frequent") because some timer wakeups were previously counted as non-
> timer ones and they were effectively added to the left-hand side of the
> inequality deciding whether or not to preselect the deepest idle state.
> 
> Many of them are now (accurately) counted as timer wakeups, so the left-
> hand side of that inequality is now effectively smaller in some cases,
> especially when timer wakeups often occur in the range below the target
> residency of the deepest enabled idle state and idle states with target
> residencies below CPUIDLE_FLAG_POLLING are often selected, but the
> majority of recent idle intervals are still above that value most of
> the time.  As a result, the deepest enabled idle state may be selected
> more often than it used to be selected in some cases.
> 
> To counter that effect, add the sum of the hits metric for all of the
> idle states below the candidate one (which is the deepest enabled idle
> state at that point) to the left-hand side of the inequality mentioned
> above.  This will cause it to be more balanced because, in principle,
> putting both timer and non-timer wakeups on both sides of it is more
> consistent than only taking into account the timer wakeups in the range
> above the target residency of the deepest enabled idle state.
> 
> Link: https://www.spec.org/jbb2015/
> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpuidle/governors/teo.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -349,13 +349,13 @@
>  	}
>  
>  	/*
> -	 * If the sum of the intercepts metric for all of the idle states
> -	 * shallower than the current candidate one (idx) is greater than the
> +	 * If the sum of the intercepts and hits metric for all of the idle
> +	 * states below the current candidate one (idx) is greater than the
>  	 * sum of the intercepts and hits metrics for the candidate state and
>  	 * all of the deeper states, a shallower idle state is likely to be a
>  	 * better choice.
>  	 */
> -	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
> +	if (2 * (idx_intercept_sum + idx_hit_sum) > cpu_data->total) {
>  		int first_suitable_idx = idx;
>  
>  		/*
> 
> 
> 

I'm curious, are Doug's numbers reproducible?
Or could you share the idle state usage numbers? Is that explainable?
Seems like a lot and it does worry me that I can't reproduce anything
as drastic.

I did a bit of x86 as well and got for Raptor Lake (I won't post the
non-x86 numbers now, but teo-tweak performs comparable to teo mainline.)

Idle 5 min:
device	 gov	 iter	 Joules	 idles	 idle_misses	 idle_miss_ratio	 belows	 aboves	
teo 	0 	170.02 	12690 	646 	0.051 	371 	275
teo 	1 	123.17 	8361 	517 	0.062 	281 	236
teo 	2 	122.76 	7741 	347 	0.045 	262 	85
teo 	3 	118.5 	8699 	668 	0.077 	307 	361
teo 	4 	80.46 	8113 	443 	0.055 	264 	179
teo-tweak 	0 	115.05 	10223 	803 	0.079 	323 	480
teo-tweak 	1 	164.41 	8523 	631 	0.074 	263 	368
teo-tweak 	2 	163.91 	8409 	711 	0.085 	256 	455
teo-tweak 	3 	137.22 	8581 	721 	0.084 	261 	460
teo-tweak 	4 	174.95 	8703 	675 	0.078 	261 	414
teo 	0 	164.34 	8443 	516 	0.061 	303 	213
teo 	1 	167.85 	8767 	492 	0.056 	256 	236
teo 	2 	166.25 	7835 	406 	0.052 	263 	143
teo 	3 	189.77 	8865 	493 	0.056 	276 	217
teo 	4 	136.97 	9185 	467 	0.051 	286 	181

At least in the idle case you can see an increase in 'above' idle_misses.

Firefox Youtube 4K video playback 2 min:
device	 gov	 iter	 Joules	 idles	 idle_misses	 idle_miss_ratio	 belows	 aboves	
teo 	0 	260.09 	67404 	11189 	0.166 	1899 	9290
teo 	1 	273.71 	76649 	12155 	0.159 	2233 	9922
teo 	2 	231.45 	59559 	10344 	0.174 	1747 	8597
teo 	3 	202.61 	58223 	10641 	0.183 	1748 	8893
teo 	4 	217.56 	61411 	10744 	0.175 	1809 	8935
teo-tweak 	0 	227.99 	61209 	11251 	0.184 	2110 	9141
teo-tweak 	1 	222.44 	61959 	10323 	0.167 	1474 	8849
teo-tweak 	2 	218.1 	64380 	11080 	0.172 	1845 	9235
teo-tweak 	3 	207.4 	60183 	11267 	0.187 	1929 	9338
teo-tweak 	4 	217.46 	61253 	10381 	0.169 	1620 	8761
menu 	0 	225.72 	87871 	26032 	0.296 	25412 	620
menu 	1 	200.36 	86577 	24712 	0.285 	24486 	226
menu 	2 	214.79 	84885 	24750 	0.292 	24556 	194
menu 	3 	206.07 	88007 	25938 	0.295 	25683 	255
menu 	4 	216.48 	88700 	26504 	0.299 	26302 	202

(Idle numbers aren't really reflective in energy used -> dominated by
active power.)
Doug Smythies Feb. 14, 2025, 4:23 a.m. UTC | #8
Hi Christian,

Thank you for trying to repeat my idle test results.

On 2025.02.13 06:07 Christian Loehle wrote:

> I'm curious, are Doug's numbers reproducible?
> Or could you share the idle state usage numbers? Is that explainable?
> Seems like a lot and it does worry me that I can't reproduce anything
> as drastic.

While I am having some severe repeatability issues with my testing,
not for this test.

Please recall my test conditions because the CPU frequency
scaling governor does matter. I was using "performance".
The power comes from the high amount of time in idle state 1.
I verified the idle state 1 power use by disabling all other idle states.
I also have HWP disabled, but do not know if it matters.
If I use the "powersave" governor (driver is intel_pstate, not
intel_cpufreq) then idle power is < 2 watts.

Anyway, my data:

http://smythies.com/~doug/linux/idle/teo-6.14/idle/perf/

> (Idle numbers aren't really reflective in energy used -> dominated by
> active power.)

Well, it depends on idle time verses active time on the computer.
I also measured the difference in the mains power at 20%, from
43.2 watts to 51.4 watts.

I am about to send a long email with all of my test results.
Since I can not seem to function without making graphs,
it has a lot of links to graphs.
Rafael J. Wysocki Feb. 14, 2025, 9:34 p.m. UTC | #9
On Thu, Feb 13, 2025 at 3:08 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 2/4/25 20:58, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > It has been observed that the recent teo governor update which concluded
> > with commit 16c8d7586c19 ("cpuidle: teo: Skip sleep length computation
> > for low latency constraints") caused the max-jOPS score of the SPECjbb
> > 2015 benchmark [1] on Intel Granite Rapids to decrease by around 1.4%.
> > While it may be argued that this is not a significant increase, the
> > previous score can be restored by tweaking the inequality used by teo
> > to decide whether or not to preselect the deepest enabled idle state.
> > That change also causes the critical-jOPS score of SPECjbb to increase
> > by around 2%.
> >
> > Namely, the likelihood of selecting the deepest enabled idle state in
> > teo on the platform in question has increased after commit 13ed5c4a6d9c
> > ("cpuidle: teo: Skip getting the sleep length if wakeups are very
> > frequent") because some timer wakeups were previously counted as non-
> > timer ones and they were effectively added to the left-hand side of the
> > inequality deciding whether or not to preselect the deepest idle state.
> >
> > Many of them are now (accurately) counted as timer wakeups, so the left-
> > hand side of that inequality is now effectively smaller in some cases,
> > especially when timer wakeups often occur in the range below the target
> > residency of the deepest enabled idle state and idle states with target
> > residencies below CPUIDLE_FLAG_POLLING are often selected, but the
> > majority of recent idle intervals are still above that value most of
> > the time.  As a result, the deepest enabled idle state may be selected
> > more often than it used to be selected in some cases.
> >
> > To counter that effect, add the sum of the hits metric for all of the
> > idle states below the candidate one (which is the deepest enabled idle
> > state at that point) to the left-hand side of the inequality mentioned
> > above.  This will cause it to be more balanced because, in principle,
> > putting both timer and non-timer wakeups on both sides of it is more
> > consistent than only taking into account the timer wakeups in the range
> > above the target residency of the deepest enabled idle state.
> >
> > Link: https://www.spec.org/jbb2015/
> > Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  drivers/cpuidle/governors/teo.c |    6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > --- a/drivers/cpuidle/governors/teo.c
> > +++ b/drivers/cpuidle/governors/teo.c
> > @@ -349,13 +349,13 @@
> >       }
> >
> >       /*
> > -      * If the sum of the intercepts metric for all of the idle states
> > -      * shallower than the current candidate one (idx) is greater than the
> > +      * If the sum of the intercepts and hits metric for all of the idle
> > +      * states below the current candidate one (idx) is greater than the
> >        * sum of the intercepts and hits metrics for the candidate state and
> >        * all of the deeper states, a shallower idle state is likely to be a
> >        * better choice.
> >        */
> > -     if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
> > +     if (2 * (idx_intercept_sum + idx_hit_sum) > cpu_data->total) {
> >               int first_suitable_idx = idx;
> >
> >               /*
> >
> >
> >
>
> I'm curious, are Doug's numbers reproducible?
> Or could you share the idle state usage numbers? Is that explainable?
> Seems like a lot and it does worry me that I can't reproduce anything
> as drastic.

Well, it may not be drastic, but the results below pretty much confirm
that this particular change isn't going in the right direction IMV.

> I did a bit of x86 as well and got for Raptor Lake (I won't post the
> non-x86 numbers now, but teo-tweak performs comparable to teo mainline.)
>
> Idle 5 min:
> device   gov     iter    Joules  idles   idle_misses     idle_miss_ratio         belows  aboves
> teo     0       170.02  12690   646     0.051   371     275
> teo     1       123.17  8361    517     0.062   281     236
> teo     2       122.76  7741    347     0.045   262     85
> teo     3       118.5   8699    668     0.077   307     361
> teo     4       80.46   8113    443     0.055   264     179
> teo-tweak       0       115.05  10223   803     0.079   323     480
> teo-tweak       1       164.41  8523    631     0.074   263     368
> teo-tweak       2       163.91  8409    711     0.085   256     455
> teo-tweak       3       137.22  8581    721     0.084   261     460
> teo-tweak       4       174.95  8703    675     0.078   261     414

So basically the energy usage goes up, idle misses go up, idle misses
ratio goes up and the "above" misses go way up.  Not so good as far as
I'm concerned.

> teo     0       164.34  8443    516     0.061   303     213
> teo     1       167.85  8767    492     0.056   256     236
> teo     2       166.25  7835    406     0.052   263     143
> teo     3       189.77  8865    493     0.056   276     217
> teo     4       136.97  9185    467     0.051   286     181

The above is menu I think?

> At least in the idle case you can see an increase in 'above' idle_misses.
>
> Firefox Youtube 4K video playback 2 min:
> device   gov     iter    Joules  idles   idle_misses     idle_miss_ratio         belows  aboves
> teo     0       260.09  67404   11189   0.166   1899    9290
> teo     1       273.71  76649   12155   0.159   2233    9922
> teo     2       231.45  59559   10344   0.174   1747    8597
> teo     3       202.61  58223   10641   0.183   1748    8893
> teo     4       217.56  61411   10744   0.175   1809    8935
> teo-tweak       0       227.99  61209   11251   0.184   2110    9141
> teo-tweak       1       222.44  61959   10323   0.167   1474    8849
> teo-tweak       2       218.1   64380   11080   0.172   1845    9235
> teo-tweak       3       207.4   60183   11267   0.187   1929    9338
> teo-tweak       4       217.46  61253   10381   0.169   1620    8761

And it doesn't improve things drastically here, although on average it
does reduce energy usage.

> menu    0       225.72  87871   26032   0.296   25412   620
> menu    1       200.36  86577   24712   0.285   24486   226
> menu    2       214.79  84885   24750   0.292   24556   194
> menu    3       206.07  88007   25938   0.295   25683   255
> menu    4       216.48  88700   26504   0.299   26302   202
>
> (Idle numbers aren't really reflective in energy used -> dominated by
> active power.)
>
Christian Loehle Feb. 18, 2025, 11:28 a.m. UTC | #10
On 2/14/25 21:34, Rafael J. Wysocki wrote:
> On Thu, Feb 13, 2025 at 3:08 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 2/4/25 20:58, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> It has been observed that the recent teo governor update which concluded
>>> with commit 16c8d7586c19 ("cpuidle: teo: Skip sleep length computation
>>> for low latency constraints") caused the max-jOPS score of the SPECjbb
>>> 2015 benchmark [1] on Intel Granite Rapids to decrease by around 1.4%.
>>> While it may be argued that this is not a significant increase, the
>>> previous score can be restored by tweaking the inequality used by teo
>>> to decide whether or not to preselect the deepest enabled idle state.
>>> That change also causes the critical-jOPS score of SPECjbb to increase
>>> by around 2%.
>>>
>>> Namely, the likelihood of selecting the deepest enabled idle state in
>>> teo on the platform in question has increased after commit 13ed5c4a6d9c
>>> ("cpuidle: teo: Skip getting the sleep length if wakeups are very
>>> frequent") because some timer wakeups were previously counted as non-
>>> timer ones and they were effectively added to the left-hand side of the
>>> inequality deciding whether or not to preselect the deepest idle state.
>>>
>>> Many of them are now (accurately) counted as timer wakeups, so the left-
>>> hand side of that inequality is now effectively smaller in some cases,
>>> especially when timer wakeups often occur in the range below the target
>>> residency of the deepest enabled idle state and idle states with target
>>> residencies below CPUIDLE_FLAG_POLLING are often selected, but the
>>> majority of recent idle intervals are still above that value most of
>>> the time.  As a result, the deepest enabled idle state may be selected
>>> more often than it used to be selected in some cases.
>>>
>>> To counter that effect, add the sum of the hits metric for all of the
>>> idle states below the candidate one (which is the deepest enabled idle
>>> state at that point) to the left-hand side of the inequality mentioned
>>> above.  This will cause it to be more balanced because, in principle,
>>> putting both timer and non-timer wakeups on both sides of it is more
>>> consistent than only taking into account the timer wakeups in the range
>>> above the target residency of the deepest enabled idle state.
>>>
>>> Link: https://www.spec.org/jbb2015/
>>> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>  drivers/cpuidle/governors/teo.c |    6 +++---
>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> --- a/drivers/cpuidle/governors/teo.c
>>> +++ b/drivers/cpuidle/governors/teo.c
>>> @@ -349,13 +349,13 @@
>>>       }
>>>
>>>       /*
>>> -      * If the sum of the intercepts metric for all of the idle states
>>> -      * shallower than the current candidate one (idx) is greater than the
>>> +      * If the sum of the intercepts and hits metric for all of the idle
>>> +      * states below the current candidate one (idx) is greater than the
>>>        * sum of the intercepts and hits metrics for the candidate state and
>>>        * all of the deeper states, a shallower idle state is likely to be a
>>>        * better choice.
>>>        */
>>> -     if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
>>> +     if (2 * (idx_intercept_sum + idx_hit_sum) > cpu_data->total) {
>>>               int first_suitable_idx = idx;
>>>
>>>               /*
>>>
>>>
>>>
>>
>> I'm curious, are Doug's numbers reproducible?
>> Or could you share the idle state usage numbers? Is that explainable?
>> Seems like a lot and it does worry me that I can't reproduce anything
>> as drastic.
> 
> Well, it may not be drastic, but the results below pretty much confirm
> that this particular change isn't going in the right direction IMV.

Agreed, I'd still be eager to pick up something like Doug reported with
my tests too :(

> 
>> I did a bit of x86 as well and got for Raptor Lake (I won't post the
>> non-x86 numbers now, but teo-tweak performs comparable to teo mainline.)
>>
>> Idle 5 min:
>> device   gov     iter    Joules  idles   idle_misses     idle_miss_ratio         belows  aboves
>> teo     0       170.02  12690   646     0.051   371     275
>> teo     1       123.17  8361    517     0.062   281     236
>> teo     2       122.76  7741    347     0.045   262     85
>> teo     3       118.5   8699    668     0.077   307     361
>> teo     4       80.46   8113    443     0.055   264     179
>> teo-tweak       0       115.05  10223   803     0.079   323     480
>> teo-tweak       1       164.41  8523    631     0.074   263     368
>> teo-tweak       2       163.91  8409    711     0.085   256     455
>> teo-tweak       3       137.22  8581    721     0.084   261     460
>> teo-tweak       4       174.95  8703    675     0.078   261     414
> 
> So basically the energy usage goes up, idle misses go up, idle misses
> ratio goes up and the "above" misses go way up.  Not so good as far as
> I'm concerned.
> 
>> teo     0       164.34  8443    516     0.061   303     213
>> teo     1       167.85  8767    492     0.056   256     236
>> teo     2       166.25  7835    406     0.052   263     143
>> teo     3       189.77  8865    493     0.056   276     217
>> teo     4       136.97  9185    467     0.051   286     181
> 
> The above is menu I think?

No this is teo again, just wanted to include it because the variance is
quite large, (not unusual for idle).
The full table (with menu (mainline))

teo 	0 	170.02 	12690 	646 	0.051 	371 	275
teo 	1 	123.17 	8361 	517 	0.062 	281 	236
teo 	2 	122.76 	7741 	347 	0.045 	262 	85
teo 	3 	118.5 	8699 	668 	0.077 	307 	361
teo 	4 	80.46 	8113 	443 	0.055 	264 	179
teo-tweak 	0 	115.05 	10223 	803 	0.079 	323 	480
teo-tweak 	1 	164.41 	8523 	631 	0.074 	263 	368
teo-tweak 	2 	163.91 	8409 	711 	0.085 	256 	455
teo-tweak 	3 	137.22 	8581 	721 	0.084 	261 	460
teo-tweak 	4 	174.95 	8703 	675 	0.078 	261 	414
teo 	0 	164.34 	8443 	516 	0.061 	303 	213
teo 	1 	167.85 	8767 	492 	0.056 	256 	236
teo 	2 	166.25 	7835 	406 	0.052 	263 	143
teo 	3 	189.77 	8865 	493 	0.056 	276 	217
teo 	4 	136.97 	9185 	467 	0.051 	286 	181
menu 	0 	180.13 	8925 	343 	0.038 	303 	40
menu 	1 	208.49 	8717 	345 	0.040 	312 	33
menu 	2 	168.38 	8451 	321 	0.038 	274 	47
menu 	3 	139.48 	7853 	310 	0.039 	289 	21
menu 	4 	166.61 	7769 	339 	0.044 	322 	17

> 
>> At least in the idle case you can see an increase in 'above' idle_misses.

Agreed, idle_misses clearly go up as mentioned.

>>
>> Firefox Youtube 4K video playback 2 min:
>> device   gov     iter    Joules  idles   idle_misses     idle_miss_ratio         belows  aboves
>> teo     0       260.09  67404   11189   0.166   1899    9290
>> teo     1       273.71  76649   12155   0.159   2233    9922
>> teo     2       231.45  59559   10344   0.174   1747    8597
>> teo     3       202.61  58223   10641   0.183   1748    8893
>> teo     4       217.56  61411   10744   0.175   1809    8935
>> teo-tweak       0       227.99  61209   11251   0.184   2110    9141
>> teo-tweak       1       222.44  61959   10323   0.167   1474    8849
>> teo-tweak       2       218.1   64380   11080   0.172   1845    9235
>> teo-tweak       3       207.4   60183   11267   0.187   1929    9338
>> teo-tweak       4       217.46  61253   10381   0.169   1620    8761
> 
> And it doesn't improve things drastically here, although on average it
> does reduce energy usage.
diff mbox series

Patch

--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -349,13 +349,13 @@ 
 	}
 
 	/*
-	 * If the sum of the intercepts metric for all of the idle states
-	 * shallower than the current candidate one (idx) is greater than the
+	 * If the sum of the intercepts and hits metric for all of the idle
+	 * states below the current candidate one (idx) is greater than the
 	 * sum of the intercepts and hits metrics for the candidate state and
 	 * all of the deeper states, a shallower idle state is likely to be a
 	 * better choice.
 	 */
-	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
+	if (2 * (idx_intercept_sum + idx_hit_sum) > cpu_data->total) {
 		int first_suitable_idx = idx;
 
 		/*