diff mbox series

[v2] cpufreq: governor: Fix negative 'idle_time' handling in dbs_update()

Message ID 20250212081438.1294503-1-zhanjie9@hisilicon.com
State Superseded
Headers show
Series [v2] cpufreq: governor: Fix negative 'idle_time' handling in dbs_update() | expand

Commit Message

Jie Zhan Feb. 12, 2025, 8:14 a.m. UTC
We observed an issue that the cpu frequency can't raise up with a 100% cpu
load when nohz is off and the 'conservative' governor is selected.

'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy()
when nohz is off.  This was found and explained in commit 9485e4ca0b48
("cpufreq: governor: Fix handling of special cases in dbs_update()").

However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection
logic in load calculation") introduced a comparison between 'idle_time' and
'samling_rate' to detect a long idle interval.  While 'idle_time' is
converted to int before comparison, it's actually promoted to unsigned
again when compared with an unsigned 'sampling_rate'.  Hence, this leads to
wrong idle interval detection when it's in fact 100% busy and sets
policy_dbs->idle_periods to a very large value.  'conservative' adjusts the
frequency to minimum because of the large 'idle_periods', such that the
frequency can't raise up.  'Ondemand' doesn't use policy_dbs->idle_periods
so it fortunately avoids the issue.

Correct negative 'idle_time' to 0 before any use of it in dbs_update().

Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation")
Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
---
v2:
- Avoid type conversion, compare current and previous idle time before
  obtaining 'idle_time'.
- Update the explanation in comments.

Discussions:
v1: https://lore.kernel.org/linux-pm/20250210130659.3533182-1-zhanjie9@hisilicon.com/
---
 drivers/cpufreq/cpufreq_governor.c | 42 ++++++++++++++----------------
 1 file changed, 19 insertions(+), 23 deletions(-)

Comments

Rafael J. Wysocki Feb. 12, 2025, 7:49 p.m. UTC | #1
On Wed, Feb 12, 2025 at 9:22 AM Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
> We observed an issue that the cpu frequency can't raise up with a 100% cpu
> load when nohz is off and the 'conservative' governor is selected.

NOHZ

>
> 'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy()
> when nohz is off.  This was found and explained in commit 9485e4ca0b48
> ("cpufreq: governor: Fix handling of special cases in dbs_update()").
>
> However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection
> logic in load calculation") introduced a comparison between 'idle_time' and
> 'samling_rate' to detect a long idle interval.  While 'idle_time' is
> converted to int before comparison, it's actually promoted to unsigned
> again when compared with an unsigned 'sampling_rate'.  Hence, this leads to
> wrong idle interval detection when it's in fact 100% busy and sets
> policy_dbs->idle_periods to a very large value.  'conservative' adjusts the
> frequency to minimum because of the large 'idle_periods', such that the
> frequency can't raise up.  'Ondemand' doesn't use policy_dbs->idle_periods
> so it fortunately avoids the issue.
>
> Correct negative 'idle_time' to 0 before any use of it in dbs_update().
>
> Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation")
> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
> ---
> v2:
> - Avoid type conversion, compare current and previous idle time before
>   obtaining 'idle_time'.
> - Update the explanation in comments.
>
> Discussions:
> v1: https://lore.kernel.org/linux-pm/20250210130659.3533182-1-zhanjie9@hisilicon.com/
> ---
>  drivers/cpufreq/cpufreq_governor.c | 42 ++++++++++++++----------------
>  1 file changed, 19 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
> index af44ee6a6430..c140e3f8d4f9 100644
> --- a/drivers/cpufreq/cpufreq_governor.c
> +++ b/drivers/cpufreq/cpufreq_governor.c
> @@ -145,7 +145,20 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>                 time_elapsed = update_time - j_cdbs->prev_update_time;
>                 j_cdbs->prev_update_time = update_time;
>
> -               idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
> +               /*
> +                * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if
> +                * it's obtained from get_cpu_idle_time_jiffy() when NO_HZ is
> +                * off, where idle_time is calculated by the difference between
> +                * time elapsed in jiffies and "busy time" obtained from CPU
> +                * statistics.  If a CPU is 100% busy, the time elapsed and busy
> +                * time should grow with the same amount in two consecutive
> +                * samples, but in practice there could be a tiny difference,
> +                * making the accumulated idle time decrease sometimes.  Hence,
> +                * in this case, idle_time should be regarded as 0 in order to
> +                * make the further process correct.
> +                */
> +               idle_time = cur_idle_time > j_cdbs->prev_cpu_idle ?
> +                           cur_idle_time - j_cdbs->prev_cpu_idle : 0;

Please avoid using the ternary operator in new code.

>                 j_cdbs->prev_cpu_idle = cur_idle_time;
>
>                 if (ignore_nice) {
> @@ -162,7 +175,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>                          * calls, so the previous load value can be used then.
>                          */
>                         load = j_cdbs->prev_load;
> -               } else if (unlikely((int)idle_time > 2 * sampling_rate &&
> +               } else if (unlikely(idle_time > 2 * sampling_rate &&
>                                     j_cdbs->prev_load)) {
>                         /*
>                          * If the CPU had gone completely idle and a task has
> @@ -189,30 +202,13 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>                         load = j_cdbs->prev_load;
>                         j_cdbs->prev_load = 0;
>                 } else {
> -                       if (time_elapsed >= idle_time) {
> -                               load = 100 * (time_elapsed - idle_time) / time_elapsed;
> -                       } else {
> -                               /*
> -                                * That can happen if idle_time is returned by
> -                                * get_cpu_idle_time_jiffy().  In that case
> -                                * idle_time is roughly equal to the difference
> -                                * between time_elapsed and "busy time" obtained
> -                                * from CPU statistics.  Then, the "busy time"
> -                                * can end up being greater than time_elapsed
> -                                * (for example, if jiffies_64 and the CPU
> -                                * statistics are updated by different CPUs),
> -                                * so idle_time may in fact be negative.  That
> -                                * means, though, that the CPU was busy all
> -                                * the time (on the rough average) during the
> -                                * last sampling interval and 100 can be
> -                                * returned as the load.
> -                                */
> -                               load = (int)idle_time < 0 ? 100 : 0;
> -                       }
> +                       load = time_elapsed > idle_time ?
> +                              100 * (time_elapsed - idle_time) / time_elapsed :
> +                              0;
>                         j_cdbs->prev_load = load;
>                 }
>
> -               if (unlikely((int)idle_time > 2 * sampling_rate)) {
> +               if (unlikely(idle_time > 2 * sampling_rate)) {
>                         unsigned int periods = idle_time / sampling_rate;
>
>                         if (periods < idle_periods)
> --
Jie Zhan Feb. 13, 2025, 1:20 a.m. UTC | #2
On 13/02/2025 03:49, Rafael J. Wysocki wrote:
> On Wed, Feb 12, 2025 at 9:22 AM Jie Zhan <zhanjie9@hisilicon.com> wrote:
>>
>> We observed an issue that the cpu frequency can't raise up with a 100% cpu
>> load when nohz is off and the 'conservative' governor is selected.
> 
> NOHZ

Cool.

> 
>>
>> 'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy()
>> when nohz is off.  This was found and explained in commit 9485e4ca0b48
>> ("cpufreq: governor: Fix handling of special cases in dbs_update()").
>>
>> However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection
>> logic in load calculation") introduced a comparison between 'idle_time' and
>> 'samling_rate' to detect a long idle interval.  While 'idle_time' is
>> converted to int before comparison, it's actually promoted to unsigned
>> again when compared with an unsigned 'sampling_rate'.  Hence, this leads to
>> wrong idle interval detection when it's in fact 100% busy and sets
>> policy_dbs->idle_periods to a very large value.  'conservative' adjusts the
>> frequency to minimum because of the large 'idle_periods', such that the
>> frequency can't raise up.  'Ondemand' doesn't use policy_dbs->idle_periods
>> so it fortunately avoids the issue.
>>
>> Correct negative 'idle_time' to 0 before any use of it in dbs_update().
>>
>> Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation")
>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
>> ---
>> v2:
>> - Avoid type conversion, compare current and previous idle time before
>>   obtaining 'idle_time'.
>> - Update the explanation in comments.
>>
>> Discussions:
>> v1: https://lore.kernel.org/linux-pm/20250210130659.3533182-1-zhanjie9@hisilicon.com/
>> ---
>>  drivers/cpufreq/cpufreq_governor.c | 42 ++++++++++++++----------------
>>  1 file changed, 19 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
>> index af44ee6a6430..c140e3f8d4f9 100644
>> --- a/drivers/cpufreq/cpufreq_governor.c
>> +++ b/drivers/cpufreq/cpufreq_governor.c
>> @@ -145,7 +145,20 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>>                 time_elapsed = update_time - j_cdbs->prev_update_time;
>>                 j_cdbs->prev_update_time = update_time;
>>
>> -               idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
>> +               /*
>> +                * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if
>> +                * it's obtained from get_cpu_idle_time_jiffy() when NO_HZ is
>> +                * off, where idle_time is calculated by the difference between
>> +                * time elapsed in jiffies and "busy time" obtained from CPU
>> +                * statistics.  If a CPU is 100% busy, the time elapsed and busy
>> +                * time should grow with the same amount in two consecutive
>> +                * samples, but in practice there could be a tiny difference,
>> +                * making the accumulated idle time decrease sometimes.  Hence,
>> +                * in this case, idle_time should be regarded as 0 in order to
>> +                * make the further process correct.
>> +                */
>> +               idle_time = cur_idle_time > j_cdbs->prev_cpu_idle ?
>> +                           cur_idle_time - j_cdbs->prev_cpu_idle : 0;
> 
> Please avoid using the ternary operator in new code.

Sure. Will fix this here and below. 

> 
>>                 j_cdbs->prev_cpu_idle = cur_idle_time;
>>
>>                 if (ignore_nice) {
>> @@ -162,7 +175,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>>                          * calls, so the previous load value can be used then.
>>                          */
>>                         load = j_cdbs->prev_load;
>> -               } else if (unlikely((int)idle_time > 2 * sampling_rate &&
>> +               } else if (unlikely(idle_time > 2 * sampling_rate &&
>>                                     j_cdbs->prev_load)) {
>>                         /*
>>                          * If the CPU had gone completely idle and a task has
>> @@ -189,30 +202,13 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>>                         load = j_cdbs->prev_load;
>>                         j_cdbs->prev_load = 0;
>>                 } else {
>> -                       if (time_elapsed >= idle_time) {
>> -                               load = 100 * (time_elapsed - idle_time) / time_elapsed;
>> -                       } else {
>> -                               /*
>> -                                * That can happen if idle_time is returned by
>> -                                * get_cpu_idle_time_jiffy().  In that case
>> -                                * idle_time is roughly equal to the difference
>> -                                * between time_elapsed and "busy time" obtained
>> -                                * from CPU statistics.  Then, the "busy time"
>> -                                * can end up being greater than time_elapsed
>> -                                * (for example, if jiffies_64 and the CPU
>> -                                * statistics are updated by different CPUs),
>> -                                * so idle_time may in fact be negative.  That
>> -                                * means, though, that the CPU was busy all
>> -                                * the time (on the rough average) during the
>> -                                * last sampling interval and 100 can be
>> -                                * returned as the load.
>> -                                */
>> -                               load = (int)idle_time < 0 ? 100 : 0;
>> -                       }
>> +                       load = time_elapsed > idle_time ?
>> +                              100 * (time_elapsed - idle_time) / time_elapsed :
>> +                              0;
>>                         j_cdbs->prev_load = load;
>>                 }
>>
>> -               if (unlikely((int)idle_time > 2 * sampling_rate)) {
>> +               if (unlikely(idle_time > 2 * sampling_rate)) {
>>                         unsigned int periods = idle_time / sampling_rate;
>>
>>                         if (periods < idle_periods)
>> --
diff mbox series

Patch

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index af44ee6a6430..c140e3f8d4f9 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -145,7 +145,20 @@  unsigned int dbs_update(struct cpufreq_policy *policy)
 		time_elapsed = update_time - j_cdbs->prev_update_time;
 		j_cdbs->prev_update_time = update_time;
 
-		idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
+		/*
+		 * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if
+		 * it's obtained from get_cpu_idle_time_jiffy() when NO_HZ is
+		 * off, where idle_time is calculated by the difference between
+		 * time elapsed in jiffies and "busy time" obtained from CPU
+		 * statistics.  If a CPU is 100% busy, the time elapsed and busy
+		 * time should grow with the same amount in two consecutive
+		 * samples, but in practice there could be a tiny difference,
+		 * making the accumulated idle time decrease sometimes.  Hence,
+		 * in this case, idle_time should be regarded as 0 in order to
+		 * make the further process correct.
+		 */
+		idle_time = cur_idle_time > j_cdbs->prev_cpu_idle ?
+			    cur_idle_time - j_cdbs->prev_cpu_idle : 0;
 		j_cdbs->prev_cpu_idle = cur_idle_time;
 
 		if (ignore_nice) {
@@ -162,7 +175,7 @@  unsigned int dbs_update(struct cpufreq_policy *policy)
 			 * calls, so the previous load value can be used then.
 			 */
 			load = j_cdbs->prev_load;
-		} else if (unlikely((int)idle_time > 2 * sampling_rate &&
+		} else if (unlikely(idle_time > 2 * sampling_rate &&
 				    j_cdbs->prev_load)) {
 			/*
 			 * If the CPU had gone completely idle and a task has
@@ -189,30 +202,13 @@  unsigned int dbs_update(struct cpufreq_policy *policy)
 			load = j_cdbs->prev_load;
 			j_cdbs->prev_load = 0;
 		} else {
-			if (time_elapsed >= idle_time) {
-				load = 100 * (time_elapsed - idle_time) / time_elapsed;
-			} else {
-				/*
-				 * That can happen if idle_time is returned by
-				 * get_cpu_idle_time_jiffy().  In that case
-				 * idle_time is roughly equal to the difference
-				 * between time_elapsed and "busy time" obtained
-				 * from CPU statistics.  Then, the "busy time"
-				 * can end up being greater than time_elapsed
-				 * (for example, if jiffies_64 and the CPU
-				 * statistics are updated by different CPUs),
-				 * so idle_time may in fact be negative.  That
-				 * means, though, that the CPU was busy all
-				 * the time (on the rough average) during the
-				 * last sampling interval and 100 can be
-				 * returned as the load.
-				 */
-				load = (int)idle_time < 0 ? 100 : 0;
-			}
+			load = time_elapsed > idle_time ?
+			       100 * (time_elapsed - idle_time) / time_elapsed :
+			       0;
 			j_cdbs->prev_load = load;
 		}
 
-		if (unlikely((int)idle_time > 2 * sampling_rate)) {
+		if (unlikely(idle_time > 2 * sampling_rate)) {
 			unsigned int periods = idle_time / sampling_rate;
 
 			if (periods < idle_periods)