diff mbox series

[v1,1/4] cpuidle: teo: Add polling flag check to early return path

Message ID 13679187.uLZWGnKmhe@rjwysocki.net
State New
Headers show
Series cpuidle: teo: Fix and cleanups | expand

Commit Message

Rafael J. Wysocki Jan. 10, 2025, 12:53 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

After commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length()
call in some cases") the teo governor behaves a bit differently on
systems where idle state 0 is a "polling" state (that is, it is not
really an idle state, but a loop continuously executed by the CPU).
Namely, on such systems it skips the tick_nohz_get_sleep_length() call
if the target residency of the current candidate idle state is small
enough.

However, if state 0 itself was to be returned, it would be returned
right away without calling tick_nohz_get_sleep_length() even on systems
where it was not a "polling" state until commit 4b20b07ce72f ("cpuidle:
teo: Don't count non-existent intercepts") that attempted to fix this
problem.

Unfortunately, commit 4b20b07ce72f has made the governor always call
tick_nohz_get_sleep_length() when about to return state 0 early, even
if that state is a "polling" one, which is inconsistent and defeats
the purpose of commit 6da8f9ba5a87 in that case.

Address this by adding a CPUIDLE_FLAG_POLLING check to the path where
state 0 is returned early to prevent tick_nohz_get_sleep_length() from
being called if it is a "polling" state.

Fixes: 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/teo.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Artem Bityutskiy Jan. 23, 2025, 4:54 p.m. UTC | #1
On Fri, 2025-01-10 at 13:16 +0000, Christian Loehle wrote:
> This would then enable intercept-detection only for <50% of the time,
> another option is to not allow intercepts selecting a polling state, but
> there were recent complaints about this exact behavior from Aboorva (+TO).
> They don't have a low-latency non-polling state.

What folks think about the following idea.

Idle governor algorithm essentially predicts the sleep time (step 1), based on
that, the idle state gets selected (step 2).

What if we add a sleep time factor and expose it via sysfs. The predicted sleep
time would be multiplied by the factor (between steps 1 and 2).

Default factor value is 1 (or 100%). If users want teo be more hesitant
selecting deeper idle states, they can set it to 0.5 (or 50%) or some other
value < 1. I users want teo to be more enthusiastic about selecting deeper idle
states, they set the factor to a value > 1.

We could expose it via sysfs and allow changing in some reasonable range.

The idea is to let users adjust the level of idle governor (teo in this case)
enthusiasm regarding deep C-state.

Thanks,
Artem.
Rafael J. Wysocki Jan. 23, 2025, 6:12 p.m. UTC | #2
On Thu, Jan 23, 2025 at 5:54 PM Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> On Fri, 2025-01-10 at 13:16 +0000, Christian Loehle wrote:
> > This would then enable intercept-detection only for <50% of the time,
> > another option is to not allow intercepts selecting a polling state, but
> > there were recent complaints about this exact behavior from Aboorva (+TO).
> > They don't have a low-latency non-polling state.
>
> What folks think about the following idea.
>
> Idle governor algorithm essentially predicts the sleep time (step 1), based on
> that, the idle state gets selected (step 2).

No, it's kind of the reverse nowadays.

Step 1 is to preselect an idle state using the idle duration
information from the recent past and step 2 is to refine the selection
using the sleep time information (unless step 1 by itself is
sufficient).

In particular, teo doesn't attempt to predict the idle duration.  It
basically picks up the deepest idle state that wouldn't have been too
deep in the majority of cases taken into account.

> What if we add a sleep time factor and expose it via sysfs. The predicted sleep
> time would be multiplied by the factor (between steps 1 and 2).
>
> Default factor value is 1 (or 100%). If users want teo be more hesitant
> selecting deeper idle states, they can set it to 0.5 (or 50%) or some other
> value < 1. I users want teo to be more enthusiastic about selecting deeper idle
> states, they set the factor to a value > 1.
>
> We could expose it via sysfs and allow changing in some reasonable range.
>
> The idea is to let users adjust the level of idle governor (teo in this case)
> enthusiasm regarding deep C-state.

For teo, that value from user space would effectively work as an upper
boundary for the target residency of idle states to select and there
is not too much difference between this and the existing PM (CPU
latency) QoS (limiting target residency is equivalent to limiting exit
latency).

If you want a "latency bias" knob, for teo it might be something
related to what is regarded as a "sufficient majority".  Right now it
is hardcoded to "over 50%", but it may be a different fraction in
principle ((e.g. 5/8, 2/3 etc.) and it may be tunable within some
limits.
diff mbox series

Patch

--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -422,7 +422,8 @@ 
 			first_suitable_idx = i;
 		}
 	}
-	if (!idx && prev_intercept_idx) {
+	if (!idx && prev_intercept_idx &&
+	    !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) {
 		/*
 		 * We have to query the sleep length here otherwise we don't
 		 * know after wakeup if our guess was correct.