diff mbox series

[v2] ACPI: Replace msleep() with usleep_range() in acpi_os_sleep().

Message ID c7db7e804c453629c116d508558eaf46477a2d73.1731708405.git.len.brown@intel.com
State New
Headers show
Series [v2] ACPI: Replace msleep() with usleep_range() in acpi_os_sleep(). | expand

Commit Message

Len Brown Nov. 15, 2024, 11:11 p.m. UTC
From: Len Brown <len.brown@intel.com>

Replace msleep() with usleep_range() in acpi_os_sleep().

This has a significant user-visible performance benefit
on some ACPI flows on some systems.  eg. Kernel resume
time of a Dell XPS-13-9300 drops from 1943ms to 1127ms (42%).

usleep_range(min, min) is used because there is scant
opportunity for timer coalescing during ACPI flows
related to system suspend, resume (or initialization).

ie. During these flows usleep_range(min, max) is observed to
be effectvely be the same as usleep_range(max, max).

Similarly, msleep() for long sleeps is not considered because
these flows almost never have opportunities to coalesce
with other activity on jiffie boundaries, leaving no
measurably benefit to rounding up to jiffie boundaries.

Background:

acpi_os_sleep() supports the ACPI AML Sleep(msec) operator,
and it must not return before the requested number of msec.

Until Linux-3.13, this contract was sometimes violated by using
schedule_timeout_interruptible(j), which could return early.

Since Linux-3.13, acpi_os_sleep() uses msleep(),
which doesn't return early, but is still subject
to long delays due to the low resolution of the jiffie clock.

Linux-6.12 removed a stray jiffie from msleep: commit 4381b895f544
("timers: Remove historical extra jiffie for timeout in msleep()")
The 4ms savings is material for some durations,
but msleep is still generally too course. eg msleep(5)
on a 250HZ system still takes 11.9ms.

System resume performance of a Dell XPS 13 9300:

Linux-6.11:
msleep HZ 250	2460 ms

Linux-6.12:
msleep HZ 250	1943 ms
msleep HZ 1000	1233 ms
usleep HZ 250	1127 ms
usleep HZ 1000	1130 ms

Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
Signed-off-by: Len Brown <len.brown@intel.com>
Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
Tested-by: Todd Brandt <todd.e.brandt@intel.com>
---
 drivers/acpi/osl.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Rafael J. Wysocki Nov. 18, 2024, 11:03 a.m. UTC | #1
On Sat, Nov 16, 2024 at 12:11 AM Len Brown <lenb@kernel.org> wrote:
>
> From: Len Brown <len.brown@intel.com>
>
> Replace msleep() with usleep_range() in acpi_os_sleep().
>
> This has a significant user-visible performance benefit
> on some ACPI flows on some systems.  eg. Kernel resume
> time of a Dell XPS-13-9300 drops from 1943ms to 1127ms (42%).

Sure.

And the argument seems to be that it is better to always use more
resources in a given path (ACPI sleep in this particular case) than to
be somewhat inaccurate which is visible in some cases.

This would mean that hrtimers should always be used everywhere, but they aren't.

While I have nothing against addressing the short sleeps issue where
the msleep() inaccuracy is too large, I don't see why this requires
using a hrtimer with no slack in all cases.

The argument seems to be that the short sleeps case is hard to
distinguish from the other cases, but I'm not sure about this.

Also, something like this might work, but for some reason you don't
want to do it:

if (ms >= 12 * MSEC_PER_SEC / HZ) {
        msleep(ms);
} else {
       u64 us = ms * USEC_PER_MSEC;

      usleep_range(us, us / 8);
}

> usleep_range(min, min) is used because there is scant
> opportunity for timer coalescing during ACPI flows
> related to system suspend, resume (or initialization).
>
> ie. During these flows usleep_range(min, max) is observed to
> be effectvely be the same as usleep_range(max, max).
>
> Similarly, msleep() for long sleeps is not considered because
> these flows almost never have opportunities to coalesce
> with other activity on jiffie boundaries, leaving no
> measurably benefit to rounding up to jiffie boundaries.
>
> Background:
>
> acpi_os_sleep() supports the ACPI AML Sleep(msec) operator,
> and it must not return before the requested number of msec.
>
> Until Linux-3.13, this contract was sometimes violated by using
> schedule_timeout_interruptible(j), which could return early.
>
> Since Linux-3.13, acpi_os_sleep() uses msleep(),
> which doesn't return early, but is still subject
> to long delays due to the low resolution of the jiffie clock.
>
> Linux-6.12 removed a stray jiffie from msleep: commit 4381b895f544
> ("timers: Remove historical extra jiffie for timeout in msleep()")
> The 4ms savings is material for some durations,
> but msleep is still generally too course. eg msleep(5)
> on a 250HZ system still takes 11.9ms.
>
> System resume performance of a Dell XPS 13 9300:
>
> Linux-6.11:
> msleep HZ 250   2460 ms
>
> Linux-6.12:
> msleep HZ 250   1943 ms
> msleep HZ 1000  1233 ms
> usleep HZ 250   1127 ms
> usleep HZ 1000  1130 ms
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
> Signed-off-by: Len Brown <len.brown@intel.com>
> Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
> Tested-by: Todd Brandt <todd.e.brandt@intel.com>
> ---
>  drivers/acpi/osl.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index 70af3fbbebe5..daf87e33b8ea 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -607,7 +607,9 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
>
>  void acpi_os_sleep(u64 ms)
>  {
> -       msleep(ms);
> +       u64 us = ms * USEC_PER_MSEC;
> +
> +       usleep_range(us, us);
>  }
>
>  void acpi_os_stall(u32 us)
> --
> 2.43.0
>
Hans de Goede Nov. 18, 2024, 11:38 a.m. UTC | #2
Hi Rafael, Len,

On 18-Nov-24 12:03 PM, Rafael J. Wysocki wrote:
> On Sat, Nov 16, 2024 at 12:11 AM Len Brown <lenb@kernel.org> wrote:
>>
>> From: Len Brown <len.brown@intel.com>
>>
>> Replace msleep() with usleep_range() in acpi_os_sleep().
>>
>> This has a significant user-visible performance benefit
>> on some ACPI flows on some systems.  eg. Kernel resume
>> time of a Dell XPS-13-9300 drops from 1943ms to 1127ms (42%).
> 
> Sure.
> 
> And the argument seems to be that it is better to always use more
> resources in a given path (ACPI sleep in this particular case) than to
> be somewhat inaccurate which is visible in some cases.
> 
> This would mean that hrtimers should always be used everywhere, but they aren't.
> 
> While I have nothing against addressing the short sleeps issue where
> the msleep() inaccuracy is too large, I don't see why this requires
> using a hrtimer with no slack in all cases.
> 
> The argument seems to be that the short sleeps case is hard to
> distinguish from the other cases, but I'm not sure about this.
> 
> Also, something like this might work, but for some reason you don't
> want to do it:
> 
> if (ms >= 12 * MSEC_PER_SEC / HZ) {
>         msleep(ms);
> } else {
>        u64 us = ms * USEC_PER_MSEC;
> 
>       usleep_range(us, us / 8);
> }

FWIW I was thinking the same thing, that it would be good to still
use msleep when the sleep is > (MSEC_PER_SEC / HZ), not sure
why you added the 12 there ? Surely something like a sleep longer
then 3 timerticks (I know we have NOHZ but still) would already be
long enough to not worry about msleep slack ?

And I assume the usleep_range(us, us / 8); is a typo ? Ma can
never be less then max, maybe you meant: usleep_range(us, us + 8) ?

OTOH it is not like we will hit these ACPI acpi_os_sleep()
calls multiple times per second all the time. On a normal idle
system I expect there to not be that many calls (could still
be a few from ACPI managed devices going into + out of
runtime-pm regularly). And if don't hit acpi_os_sleep() calls
multiple times per second then the chances of time coalescing
are not that big anyways.

Still I think that finding something middle ground between always
sleeping the exact min time and the old msleep() call, as Rafael
is proposing, would be good IMHO.

Regards,

Hans




> 
>> usleep_range(min, min) is used because there is scant
>> opportunity for timer coalescing during ACPI flows
>> related to system suspend, resume (or initialization).
>>
>> ie. During these flows usleep_range(min, max) is observed to
>> be effectvely be the same as usleep_range(max, max).
>>
>> Similarly, msleep() for long sleeps is not considered because
>> these flows almost never have opportunities to coalesce
>> with other activity on jiffie boundaries, leaving no
>> measurably benefit to rounding up to jiffie boundaries.
>>
>> Background:
>>
>> acpi_os_sleep() supports the ACPI AML Sleep(msec) operator,
>> and it must not return before the requested number of msec.
>>
>> Until Linux-3.13, this contract was sometimes violated by using
>> schedule_timeout_interruptible(j), which could return early.
>>
>> Since Linux-3.13, acpi_os_sleep() uses msleep(),
>> which doesn't return early, but is still subject
>> to long delays due to the low resolution of the jiffie clock.
>>
>> Linux-6.12 removed a stray jiffie from msleep: commit 4381b895f544
>> ("timers: Remove historical extra jiffie for timeout in msleep()")
>> The 4ms savings is material for some durations,
>> but msleep is still generally too course. eg msleep(5)
>> on a 250HZ system still takes 11.9ms.
>>
>> System resume performance of a Dell XPS 13 9300:
>>
>> Linux-6.11:
>> msleep HZ 250   2460 ms
>>
>> Linux-6.12:
>> msleep HZ 250   1943 ms
>> msleep HZ 1000  1233 ms
>> usleep HZ 250   1127 ms
>> usleep HZ 1000  1130 ms
>>
>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
>> Signed-off-by: Len Brown <len.brown@intel.com>
>> Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
>> Tested-by: Todd Brandt <todd.e.brandt@intel.com>
>> ---
>>  drivers/acpi/osl.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
>> index 70af3fbbebe5..daf87e33b8ea 100644
>> --- a/drivers/acpi/osl.c
>> +++ b/drivers/acpi/osl.c
>> @@ -607,7 +607,9 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
>>
>>  void acpi_os_sleep(u64 ms)
>>  {
>> -       msleep(ms);
>> +       u64 us = ms * USEC_PER_MSEC;
>> +
>> +       usleep_range(us, us);
>>  }
>>
>>  void acpi_os_stall(u32 us)
>> --
>> 2.43.0
>>
>
Rafael J. Wysocki Nov. 18, 2024, 12:02 p.m. UTC | #3
Hi Hans,

On Mon, Nov 18, 2024 at 12:38 PM Hans de Goede <hdegoede@redhat.com> wrote:
>
> Hi Rafael, Len,
>
> On 18-Nov-24 12:03 PM, Rafael J. Wysocki wrote:
> > On Sat, Nov 16, 2024 at 12:11 AM Len Brown <lenb@kernel.org> wrote:
> >>
> >> From: Len Brown <len.brown@intel.com>
> >>
> >> Replace msleep() with usleep_range() in acpi_os_sleep().
> >>
> >> This has a significant user-visible performance benefit
> >> on some ACPI flows on some systems.  eg. Kernel resume
> >> time of a Dell XPS-13-9300 drops from 1943ms to 1127ms (42%).
> >
> > Sure.
> >
> > And the argument seems to be that it is better to always use more
> > resources in a given path (ACPI sleep in this particular case) than to
> > be somewhat inaccurate which is visible in some cases.
> >
> > This would mean that hrtimers should always be used everywhere, but they aren't.
> >
> > While I have nothing against addressing the short sleeps issue where
> > the msleep() inaccuracy is too large, I don't see why this requires
> > using a hrtimer with no slack in all cases.
> >
> > The argument seems to be that the short sleeps case is hard to
> > distinguish from the other cases, but I'm not sure about this.
> >
> > Also, something like this might work, but for some reason you don't
> > want to do it:
> >
> > if (ms >= 12 * MSEC_PER_SEC / HZ) {
> >         msleep(ms);
> > } else {
> >        u64 us = ms * USEC_PER_MSEC;
> >
> >       usleep_range(us, us / 8);

Should be

      usleep_range(us, us + us / 8);

(I notoriously confuse this API).

> > }
>
> FWIW I was thinking the same thing, that it would be good to still
> use msleep when the sleep is > (MSEC_PER_SEC / HZ), not sure
> why you added the 12 there ? Surely something like a sleep longer
> then 3 timerticks (I know we have NOHZ but still) would already be
> long enough to not worry about msleep slack ?

The typical msleep() overhead in 6.12 appears to be 1.5 jiffy which is
1.5 * MSEC_PER_SEC / HZ and I want the usleep() delta to be less than
this, so

delta = ms / 8 <= 1.5 * MSEC_PER_SEC / HZ

> And I assume the usleep_range(us, us / 8); is a typo ? Ma can
> never be less then max, maybe you meant: usleep_range(us, us + 8) ?

No, please see above.

> OTOH it is not like we will hit these ACPI acpi_os_sleep()
> calls multiple times per second all the time. On a normal idle
> system I expect there to not be that many calls (could still
> be a few from ACPI managed devices going into + out of
> runtime-pm regularly). And if don't hit acpi_os_sleep() calls
> multiple times per second then the chances of time coalescing
> are not that big anyways.
>
> Still I think that finding something middle ground between always
> sleeping the exact min time and the old msleep() call, as Rafael
> is proposing, would be good IMHO.

Thanks for the feedback!


> >> usleep_range(min, min) is used because there is scant
> >> opportunity for timer coalescing during ACPI flows
> >> related to system suspend, resume (or initialization).
> >>
> >> ie. During these flows usleep_range(min, max) is observed to
> >> be effectvely be the same as usleep_range(max, max).
> >>
> >> Similarly, msleep() for long sleeps is not considered because
> >> these flows almost never have opportunities to coalesce
> >> with other activity on jiffie boundaries, leaving no
> >> measurably benefit to rounding up to jiffie boundaries.
> >>
> >> Background:
> >>
> >> acpi_os_sleep() supports the ACPI AML Sleep(msec) operator,
> >> and it must not return before the requested number of msec.
> >>
> >> Until Linux-3.13, this contract was sometimes violated by using
> >> schedule_timeout_interruptible(j), which could return early.
> >>
> >> Since Linux-3.13, acpi_os_sleep() uses msleep(),
> >> which doesn't return early, but is still subject
> >> to long delays due to the low resolution of the jiffie clock.
> >>
> >> Linux-6.12 removed a stray jiffie from msleep: commit 4381b895f544
> >> ("timers: Remove historical extra jiffie for timeout in msleep()")
> >> The 4ms savings is material for some durations,
> >> but msleep is still generally too course. eg msleep(5)
> >> on a 250HZ system still takes 11.9ms.
> >>
> >> System resume performance of a Dell XPS 13 9300:
> >>
> >> Linux-6.11:
> >> msleep HZ 250   2460 ms
> >>
> >> Linux-6.12:
> >> msleep HZ 250   1943 ms
> >> msleep HZ 1000  1233 ms
> >> usleep HZ 250   1127 ms
> >> usleep HZ 1000  1130 ms
> >>
> >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
> >> Signed-off-by: Len Brown <len.brown@intel.com>
> >> Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
> >> Tested-by: Todd Brandt <todd.e.brandt@intel.com>
> >> ---
> >>  drivers/acpi/osl.c | 4 +++-
> >>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> >> index 70af3fbbebe5..daf87e33b8ea 100644
> >> --- a/drivers/acpi/osl.c
> >> +++ b/drivers/acpi/osl.c
> >> @@ -607,7 +607,9 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
> >>
> >>  void acpi_os_sleep(u64 ms)
> >>  {
> >> -       msleep(ms);
> >> +       u64 us = ms * USEC_PER_MSEC;
> >> +
> >> +       usleep_range(us, us);
> >>  }
> >>
> >>  void acpi_os_stall(u32 us)
> >> --
> >> 2.43.0
> >>
> >
>
Hans de Goede Nov. 18, 2024, 12:10 p.m. UTC | #4
Hi Rafael,

On 18-Nov-24 1:02 PM, Rafael J. Wysocki wrote:
> Hi Hans,
> 
> On Mon, Nov 18, 2024 at 12:38 PM Hans de Goede <hdegoede@redhat.com> wrote:
>>
>> Hi Rafael, Len,
>>
>> On 18-Nov-24 12:03 PM, Rafael J. Wysocki wrote:
>>> On Sat, Nov 16, 2024 at 12:11 AM Len Brown <lenb@kernel.org> wrote:
>>>>
>>>> From: Len Brown <len.brown@intel.com>
>>>>
>>>> Replace msleep() with usleep_range() in acpi_os_sleep().
>>>>
>>>> This has a significant user-visible performance benefit
>>>> on some ACPI flows on some systems.  eg. Kernel resume
>>>> time of a Dell XPS-13-9300 drops from 1943ms to 1127ms (42%).
>>>
>>> Sure.
>>>
>>> And the argument seems to be that it is better to always use more
>>> resources in a given path (ACPI sleep in this particular case) than to
>>> be somewhat inaccurate which is visible in some cases.
>>>
>>> This would mean that hrtimers should always be used everywhere, but they aren't.
>>>
>>> While I have nothing against addressing the short sleeps issue where
>>> the msleep() inaccuracy is too large, I don't see why this requires
>>> using a hrtimer with no slack in all cases.
>>>
>>> The argument seems to be that the short sleeps case is hard to
>>> distinguish from the other cases, but I'm not sure about this.
>>>
>>> Also, something like this might work, but for some reason you don't
>>> want to do it:
>>>
>>> if (ms >= 12 * MSEC_PER_SEC / HZ) {
>>>         msleep(ms);
>>> } else {
>>>        u64 us = ms * USEC_PER_MSEC;
>>>
>>>       usleep_range(us, us / 8);
> 
> Should be
> 
>       usleep_range(us, us + us / 8);
> 
> (I notoriously confuse this API).

I see.

>>> }
>>
>> FWIW I was thinking the same thing, that it would be good to still
>> use msleep when the sleep is > (MSEC_PER_SEC / HZ), not sure
>> why you added the 12 there ? Surely something like a sleep longer
>> then 3 timerticks (I know we have NOHZ but still) would already be
>> long enough to not worry about msleep slack ?
> 
> The typical msleep() overhead in 6.12 appears to be 1.5 jiffy which is
> 1.5 * MSEC_PER_SEC / HZ and I want the usleep() delta to be less than
> this, so
> 
> delta = ms / 8 <= 1.5 * MSEC_PER_SEC / HZ

Ok, that makes sense. But this probably requires a comment explaining
this so that when someone looks at this in the future they understand
where the 12 comes from.

Where as the / 8 is just a choice right? I think it is decent choice,
but still this is just a value you picked which should work nicely,
right ?

>> OTOH it is not like we will hit these ACPI acpi_os_sleep()
>> calls multiple times per second all the time. On a normal idle
>> system I expect there to not be that many calls (could still
>> be a few from ACPI managed devices going into + out of
>> runtime-pm regularly). And if don't hit acpi_os_sleep() calls
>> multiple times per second then the chances of time coalescing
>> are not that big anyways.
>>
>> Still I think that finding something middle ground between always
>> sleeping the exact min time and the old msleep() call, as Rafael
>> is proposing, would be good IMHO.
> 
> Thanks for the feedback!

You're welcome.

Len any chance you can give Rafael's proposal a test run on the
same Dell XPS 13 9300 and see what this means for the resume time ?

If this gets close enough to your patch I think we should go with
what Rafael is proposing.

Regards,

Hans




>>>> usleep_range(min, min) is used because there is scant
>>>> opportunity for timer coalescing during ACPI flows
>>>> related to system suspend, resume (or initialization).
>>>>
>>>> ie. During these flows usleep_range(min, max) is observed to
>>>> be effectvely be the same as usleep_range(max, max).
>>>>
>>>> Similarly, msleep() for long sleeps is not considered because
>>>> these flows almost never have opportunities to coalesce
>>>> with other activity on jiffie boundaries, leaving no
>>>> measurably benefit to rounding up to jiffie boundaries.
>>>>
>>>> Background:
>>>>
>>>> acpi_os_sleep() supports the ACPI AML Sleep(msec) operator,
>>>> and it must not return before the requested number of msec.
>>>>
>>>> Until Linux-3.13, this contract was sometimes violated by using
>>>> schedule_timeout_interruptible(j), which could return early.
>>>>
>>>> Since Linux-3.13, acpi_os_sleep() uses msleep(),
>>>> which doesn't return early, but is still subject
>>>> to long delays due to the low resolution of the jiffie clock.
>>>>
>>>> Linux-6.12 removed a stray jiffie from msleep: commit 4381b895f544
>>>> ("timers: Remove historical extra jiffie for timeout in msleep()")
>>>> The 4ms savings is material for some durations,
>>>> but msleep is still generally too course. eg msleep(5)
>>>> on a 250HZ system still takes 11.9ms.
>>>>
>>>> System resume performance of a Dell XPS 13 9300:
>>>>
>>>> Linux-6.11:
>>>> msleep HZ 250   2460 ms
>>>>
>>>> Linux-6.12:
>>>> msleep HZ 250   1943 ms
>>>> msleep HZ 1000  1233 ms
>>>> usleep HZ 250   1127 ms
>>>> usleep HZ 1000  1130 ms
>>>>
>>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
>>>> Signed-off-by: Len Brown <len.brown@intel.com>
>>>> Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
>>>> Tested-by: Todd Brandt <todd.e.brandt@intel.com>
>>>> ---
>>>>  drivers/acpi/osl.c | 4 +++-
>>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
>>>> index 70af3fbbebe5..daf87e33b8ea 100644
>>>> --- a/drivers/acpi/osl.c
>>>> +++ b/drivers/acpi/osl.c
>>>> @@ -607,7 +607,9 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
>>>>
>>>>  void acpi_os_sleep(u64 ms)
>>>>  {
>>>> -       msleep(ms);
>>>> +       u64 us = ms * USEC_PER_MSEC;
>>>> +
>>>> +       usleep_range(us, us);
>>>>  }
>>>>
>>>>  void acpi_os_stall(u32 us)
>>>> --
>>>> 2.43.0
>>>>
>>>
>>
>
Rafael J. Wysocki Nov. 18, 2024, 12:22 p.m. UTC | #5
Hi Hans,

On Mon, Nov 18, 2024 at 1:10 PM Hans de Goede <hdegoede@redhat.com> wrote:
>
> Hi Rafael,
>
> On 18-Nov-24 1:02 PM, Rafael J. Wysocki wrote:
> > Hi Hans,
> >
> > On Mon, Nov 18, 2024 at 12:38 PM Hans de Goede <hdegoede@redhat.com> wrote:
> >>
> >> Hi Rafael, Len,
> >>
> >> On 18-Nov-24 12:03 PM, Rafael J. Wysocki wrote:
> >>> On Sat, Nov 16, 2024 at 12:11 AM Len Brown <lenb@kernel.org> wrote:
> >>>>
> >>>> From: Len Brown <len.brown@intel.com>
> >>>>
> >>>> Replace msleep() with usleep_range() in acpi_os_sleep().
> >>>>
> >>>> This has a significant user-visible performance benefit
> >>>> on some ACPI flows on some systems.  eg. Kernel resume
> >>>> time of a Dell XPS-13-9300 drops from 1943ms to 1127ms (42%).
> >>>
> >>> Sure.
> >>>
> >>> And the argument seems to be that it is better to always use more
> >>> resources in a given path (ACPI sleep in this particular case) than to
> >>> be somewhat inaccurate which is visible in some cases.
> >>>
> >>> This would mean that hrtimers should always be used everywhere, but they aren't.
> >>>
> >>> While I have nothing against addressing the short sleeps issue where
> >>> the msleep() inaccuracy is too large, I don't see why this requires
> >>> using a hrtimer with no slack in all cases.
> >>>
> >>> The argument seems to be that the short sleeps case is hard to
> >>> distinguish from the other cases, but I'm not sure about this.
> >>>
> >>> Also, something like this might work, but for some reason you don't
> >>> want to do it:
> >>>
> >>> if (ms >= 12 * MSEC_PER_SEC / HZ) {
> >>>         msleep(ms);
> >>> } else {
> >>>        u64 us = ms * USEC_PER_MSEC;
> >>>
> >>>       usleep_range(us, us / 8);
> >
> > Should be
> >
> >       usleep_range(us, us + us / 8);
> >
> > (I notoriously confuse this API).
>
> I see.
>
> >>> }
> >>
> >> FWIW I was thinking the same thing, that it would be good to still
> >> use msleep when the sleep is > (MSEC_PER_SEC / HZ), not sure
> >> why you added the 12 there ? Surely something like a sleep longer
> >> then 3 timerticks (I know we have NOHZ but still) would already be
> >> long enough to not worry about msleep slack ?
> >
> > The typical msleep() overhead in 6.12 appears to be 1.5 jiffy which is
> > 1.5 * MSEC_PER_SEC / HZ and I want the usleep() delta to be less than
> > this, so
> >
> > delta = ms / 8 <= 1.5 * MSEC_PER_SEC / HZ
>
> Ok, that makes sense. But this probably requires a comment explaining
> this so that when someone looks at this in the future they understand
> where the 12 comes from.

Sure.

> Where as the / 8 is just a choice right? I think it is decent choice,
> but still this is just a value you picked which should work nicely,
> right ?

Right.

I chose a power of 2 close to 10%.

> >> OTOH it is not like we will hit these ACPI acpi_os_sleep()
> >> calls multiple times per second all the time. On a normal idle
> >> system I expect there to not be that many calls (could still
> >> be a few from ACPI managed devices going into + out of
> >> runtime-pm regularly). And if don't hit acpi_os_sleep() calls
> >> multiple times per second then the chances of time coalescing
> >> are not that big anyways.
> >>
> >> Still I think that finding something middle ground between always
> >> sleeping the exact min time and the old msleep() call, as Rafael
> >> is proposing, would be good IMHO.
> >
> > Thanks for the feedback!
>
> You're welcome.
>
> Len any chance you can give Rafael's proposal a test run on the
> same Dell XPS 13 9300 and see what this means for the resume time ?
>
> If this gets close enough to your patch I think we should go with
> what Rafael is proposing.

Thanks!


> >>>> usleep_range(min, min) is used because there is scant
> >>>> opportunity for timer coalescing during ACPI flows
> >>>> related to system suspend, resume (or initialization).
> >>>>
> >>>> ie. During these flows usleep_range(min, max) is observed to
> >>>> be effectvely be the same as usleep_range(max, max).
> >>>>
> >>>> Similarly, msleep() for long sleeps is not considered because
> >>>> these flows almost never have opportunities to coalesce
> >>>> with other activity on jiffie boundaries, leaving no
> >>>> measurably benefit to rounding up to jiffie boundaries.
> >>>>
> >>>> Background:
> >>>>
> >>>> acpi_os_sleep() supports the ACPI AML Sleep(msec) operator,
> >>>> and it must not return before the requested number of msec.
> >>>>
> >>>> Until Linux-3.13, this contract was sometimes violated by using
> >>>> schedule_timeout_interruptible(j), which could return early.
> >>>>
> >>>> Since Linux-3.13, acpi_os_sleep() uses msleep(),
> >>>> which doesn't return early, but is still subject
> >>>> to long delays due to the low resolution of the jiffie clock.
> >>>>
> >>>> Linux-6.12 removed a stray jiffie from msleep: commit 4381b895f544
> >>>> ("timers: Remove historical extra jiffie for timeout in msleep()")
> >>>> The 4ms savings is material for some durations,
> >>>> but msleep is still generally too course. eg msleep(5)
> >>>> on a 250HZ system still takes 11.9ms.
> >>>>
> >>>> System resume performance of a Dell XPS 13 9300:
> >>>>
> >>>> Linux-6.11:
> >>>> msleep HZ 250   2460 ms
> >>>>
> >>>> Linux-6.12:
> >>>> msleep HZ 250   1943 ms
> >>>> msleep HZ 1000  1233 ms
> >>>> usleep HZ 250   1127 ms
> >>>> usleep HZ 1000  1130 ms
> >>>>
> >>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
> >>>> Signed-off-by: Len Brown <len.brown@intel.com>
> >>>> Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
> >>>> Tested-by: Todd Brandt <todd.e.brandt@intel.com>
> >>>> ---
> >>>>  drivers/acpi/osl.c | 4 +++-
> >>>>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> >>>> index 70af3fbbebe5..daf87e33b8ea 100644
> >>>> --- a/drivers/acpi/osl.c
> >>>> +++ b/drivers/acpi/osl.c
> >>>> @@ -607,7 +607,9 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
> >>>>
> >>>>  void acpi_os_sleep(u64 ms)
> >>>>  {
> >>>> -       msleep(ms);
> >>>> +       u64 us = ms * USEC_PER_MSEC;
> >>>> +
> >>>> +       usleep_range(us, us);
> >>>>  }
> >>>>
> >>>>  void acpi_os_stall(u32 us)
> >>>> --
> >>>> 2.43.0
> >>>>
> >>>
> >>
> >
>
Arjan van de Ven Nov. 18, 2024, 2:35 p.m. UTC | #6
> And the argument seems to be that it is better to always use more
> resources in a given path (ACPI sleep in this particular case) than to
> be somewhat inaccurate which is visible in some cases.
> 
> This would mean that hrtimers should always be used everywhere, but they aren't.

more or less rule of thumb is that regular timers are optimized for not firing case
(e.g. timeouts that get deleted when the actual event happens) while hrtimers
are optimized for the case where the timer is expected to fire.

(I'm with you on the slack argument, some amount of slack, even if it is only a usec or two,
is very helpful)
Rafael J. Wysocki Nov. 19, 2024, 1:42 p.m. UTC | #7
On Mon, Nov 18, 2024 at 3:35 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> > And the argument seems to be that it is better to always use more
> > resources in a given path (ACPI sleep in this particular case) than to
> > be somewhat inaccurate which is visible in some cases.
> >
> > This would mean that hrtimers should always be used everywhere, but they aren't.
>
> more or less rule of thumb is that regular timers are optimized for not firing case
> (e.g. timeouts that get deleted when the actual event happens) while hrtimers
> are optimized for the case where the timer is expected to fire.

I've heard that, which makes me wonder why msleep() is still there.

One thing that's rarely mentioned is that programming a timer in HW
actually takes time, so if it is done too often, it hurts performance
through latency (even if this is the TSC deadline timer).

> (I'm with you on the slack argument, some amount of slack, even if it is only a usec or two,
> is very helpful)

Thanks!
Arjan van de Ven Nov. 19, 2024, 3:08 p.m. UTC | #8
On 11/19/2024 5:42 AM, Rafael J. Wysocki wrote:
> On Mon, Nov 18, 2024 at 3:35 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
>>
>>> And the argument seems to be that it is better to always use more
>>> resources in a given path (ACPI sleep in this particular case) than to
>>> be somewhat inaccurate which is visible in some cases.
>>>
>>> This would mean that hrtimers should always be used everywhere, but they aren't.
>>
>> more or less rule of thumb is that regular timers are optimized for not firing case
>> (e.g. timeouts that get deleted when the actual event happens) while hrtimers
>> are optimized for the case where the timer is expected to fire.
> 
> I've heard that, which makes me wonder why msleep() is still there.
> 
> One thing that's rarely mentioned is that programming a timer in HW
> actually takes time, so if it is done too often, it hurts performance
> through latency (even if this is the TSC deadline timer).

yup and this is why you want to group events together "somewhat", and which is why
we have slack, to allow that to happen

> 
>> (I'm with you on the slack argument, some amount of slack, even if it is only a usec or two,
>> is very helpful)
> 
> Thanks!
Pierre Gondois Nov. 20, 2024, 9:01 a.m. UTC | #9
On 11/18/24 13:02, Rafael J. Wysocki wrote:
> Hi Hans,
> 
> On Mon, Nov 18, 2024 at 12:38 PM Hans de Goede <hdegoede@redhat.com> wrote:
>>
>> Hi Rafael, Len,
>>
>> On 18-Nov-24 12:03 PM, Rafael J. Wysocki wrote:
>>> On Sat, Nov 16, 2024 at 12:11 AM Len Brown <lenb@kernel.org> wrote:
>>>>
>>>> From: Len Brown <len.brown@intel.com>
>>>>
>>>> Replace msleep() with usleep_range() in acpi_os_sleep().
>>>>
>>>> This has a significant user-visible performance benefit
>>>> on some ACPI flows on some systems.  eg. Kernel resume
>>>> time of a Dell XPS-13-9300 drops from 1943ms to 1127ms (42%).
>>>
>>> Sure.
>>>
>>> And the argument seems to be that it is better to always use more
>>> resources in a given path (ACPI sleep in this particular case) than to
>>> be somewhat inaccurate which is visible in some cases.
>>>
>>> This would mean that hrtimers should always be used everywhere, but they aren't.
>>>
>>> While I have nothing against addressing the short sleeps issue where
>>> the msleep() inaccuracy is too large, I don't see why this requires
>>> using a hrtimer with no slack in all cases.
>>>
>>> The argument seems to be that the short sleeps case is hard to
>>> distinguish from the other cases, but I'm not sure about this.
>>>
>>> Also, something like this might work, but for some reason you don't
>>> want to do it:
>>>
>>> if (ms >= 12 * MSEC_PER_SEC / HZ) {
>>>          msleep(ms);
>>> } else {
>>>         u64 us = ms * USEC_PER_MSEC;
>>>
>>>        usleep_range(us, us / 8);
> 
> Should be
> 
>        usleep_range(us, us + us / 8);
> 
> (I notoriously confuse this API).
> 
>>> }
>>
>> FWIW I was thinking the same thing, that it would be good to still
>> use msleep when the sleep is > (MSEC_PER_SEC / HZ), not sure
>> why you added the 12 there ? Surely something like a sleep longer
>> then 3 timerticks (I know we have NOHZ but still) would already be
>> long enough to not worry about msleep slack ?
> 
> The typical msleep() overhead in 6.12 appears to be 1.5 jiffy which is
> 1.5 * MSEC_PER_SEC / HZ and I want the usleep() delta to be less than
> this, so
> 
> delta = ms / 8 <= 1.5 * MSEC_PER_SEC / HZ
> 
>> And I assume the usleep_range(us, us / 8); is a typo ? Ma can
>> never be less then max, maybe you meant: usleep_range(us, us + 8) ?
> 
> No, please see above.
> 
>> OTOH it is not like we will hit these ACPI acpi_os_sleep()
>> calls multiple times per second all the time. On a normal idle
>> system I expect there to not be that many calls (could still
>> be a few from ACPI managed devices going into + out of
>> runtime-pm regularly). And if don't hit acpi_os_sleep() calls
>> multiple times per second then the chances of time coalescing
>> are not that big anyways.
>>
>> Still I think that finding something middle ground between always
>> sleeping the exact min time and the old msleep() call, as Rafael
>> is proposing, would be good IMHO.
> 
> Thanks for the feedback!
> 
> 
>>>> usleep_range(min, min) is used because there is scant
>>>> opportunity for timer coalescing during ACPI flows
>>>> related to system suspend, resume (or initialization).
>>>>
>>>> ie. During these flows usleep_range(min, max) is observed to
>>>> be effectvely be the same as usleep_range(max, max).
>>>>
>>>> Similarly, msleep() for long sleeps is not considered because
>>>> these flows almost never have opportunities to coalesce
>>>> with other activity on jiffie boundaries, leaving no
>>>> measurably benefit to rounding up to jiffie boundaries.
>>>>
>>>> Background:
>>>>
>>>> acpi_os_sleep() supports the ACPI AML Sleep(msec) operator,
>>>> and it must not return before the requested number of msec.
>>>>
>>>> Until Linux-3.13, this contract was sometimes violated by using
>>>> schedule_timeout_interruptible(j), which could return early.
>>>>
>>>> Since Linux-3.13, acpi_os_sleep() uses msleep(),
>>>> which doesn't return early, but is still subject
>>>> to long delays due to the low resolution of the jiffie clock.
>>>>
>>>> Linux-6.12 removed a stray jiffie from msleep: commit 4381b895f544
>>>> ("timers: Remove historical extra jiffie for timeout in msleep()")
>>>> The 4ms savings is material for some durations,
>>>> but msleep is still generally too course. eg msleep(5)
>>>> on a 250HZ system still takes 11.9ms.
>>>>
>>>> System resume performance of a Dell XPS 13 9300:
>>>>
>>>> Linux-6.11:
>>>> msleep HZ 250   2460 ms
>>>>
>>>> Linux-6.12:
>>>> msleep HZ 250   1943 ms
>>>> msleep HZ 1000  1233 ms
>>>> usleep HZ 250   1127 ms
>>>> usleep HZ 1000  1130 ms
>>>>
>>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
>>>> Signed-off-by: Len Brown <len.brown@intel.com>
>>>> Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
>>>> Tested-by: Todd Brandt <todd.e.brandt@intel.com>
>>>> ---
>>>>   drivers/acpi/osl.c | 4 +++-
>>>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
>>>> index 70af3fbbebe5..daf87e33b8ea 100644
>>>> --- a/drivers/acpi/osl.c
>>>> +++ b/drivers/acpi/osl.c
>>>> @@ -607,7 +607,9 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
>>>>
>>>>   void acpi_os_sleep(u64 ms)
>>>>   {
>>>> -       msleep(ms);
>>>> +       u64 us = ms * USEC_PER_MSEC;
>>>> +
>>>> +       usleep_range(us, us);
>>>>   }
>>>>
>>>>   void acpi_os_stall(u32 us)
>>>> --
>>>> 2.43.0
>>>>
>>>
>>
> 


FWIW, testing the above version on an Arm Juno platform by executing
the following method:

Method (SLEE, 1, Serialized)  {
   Sleep(Arg0)
}

_wo: without patch
_w: with patch
- Values in ns.
- Requesting to sleep X ms
- Tested over 10 iterations
- HZ=250
+------+------------+----------+------------+---------+-----------+
|   ms |    mean_wo |   std_wo |    mean_w  |  std_w  |     ratio |
+------+------------+----------+------------+---------+-----------+
|    1 |    8087797 |  2079703 |    1313920 |   55066 | -83.75429 |
|    2 |    7942471 |  2201985 |    2416064 |  111604 | -69.58044 |
|    3 |    8373704 |   144274 |    3632537 |  111037 | -56.61970 |
|    4 |    7946013 |  2214330 |    4606028 |  255838 | -42.03346 |
|    5 |   11418920 |  1673914 |    5955548 |  131862 | -47.84490 |
|    6 |   11427042 |  1677519 |    7045713 |  211439 | -38.34176 |
|    7 |   12301242 |   221580 |    8174633 |  330050 | -33.54628 |
|    8 |   11411606 |  1672182 |    9191048 |  431767 | -19.45877 |
|    9 |   16722304 |  1288625 |   10517284 |  103274 | -37.10625 |
|   10 |   16746542 |  1280385 |   11564426 |  417218 | -30.94439 |
|   20 |   24294957 |    70703 |   22756497 |  673936 |  -6.33243 |
|   30 |   36284782 |    74340 |   34131455 |  391473 |  -5.93452 |
|   40 |   44703162 |  1199709 |   45407108 |  289715 |   1.57471 |
|   50 |   56311282 |   281418 |   56098040 |  607739 |  -0.37868 |
|   60 |   64225811 |   247587 |   64302246 |  132059 |   0.11901 |
|   70 |   76299457 |    99853 |   76282497 |   83910 |  -0.02223 |
|  100 |  104214393 |    38642 |  104212524 |  244424 |  -0.00179 |
| 1000 | 1016131215 |   245725 | 1017051744 | 2748280 |   0.09059 |
| 2000 | 2007711297 |  1325094 | 2007628922 | 1421807 |  -0.00410 |
+------+------------+----------+------------+---------+-----------+
- With the patch, the min sleep duration is never below the requested
   sleep duration

So indeed the penalty of using msleep is big for small sleep durations.
Dietmar Eggemann Nov. 20, 2024, 12:06 p.m. UTC | #10
On 20/11/2024 10:01, Pierre Gondois wrote:
> 
> 
> On 11/18/24 13:02, Rafael J. Wysocki wrote:
>> Hi Hans,
>>
>> On Mon, Nov 18, 2024 at 12:38 PM Hans de Goede <hdegoede@redhat.com>
>> wrote:
>>>
>>> Hi Rafael, Len,
>>>
>>> On 18-Nov-24 12:03 PM, Rafael J. Wysocki wrote:
>>>> On Sat, Nov 16, 2024 at 12:11 AM Len Brown <lenb@kernel.org> wrote:

[...]

> FWIW, testing the above version on an Arm Juno platform by executing
> the following method:
> 
> Method (SLEE, 1, Serialized)  {
>   Sleep(Arg0)
> }
> 
> _wo: without patch
> _w: with patch
> - Values in ns.
> - Requesting to sleep X ms
> - Tested over 10 iterations
> - HZ=250
> +------+------------+----------+------------+---------+-----------+
> |   ms |    mean_wo |   std_wo |    mean_w  |  std_w  |     ratio |
> +------+------------+----------+------------+---------+-----------+
> |    1 |    8087797 |  2079703 |    1313920 |   55066 | -83.75429 |
> |    2 |    7942471 |  2201985 |    2416064 |  111604 | -69.58044 |
> |    3 |    8373704 |   144274 |    3632537 |  111037 | -56.61970 |
> |    4 |    7946013 |  2214330 |    4606028 |  255838 | -42.03346 |
> |    5 |   11418920 |  1673914 |    5955548 |  131862 | -47.84490 |
> |    6 |   11427042 |  1677519 |    7045713 |  211439 | -38.34176 |
> |    7 |   12301242 |   221580 |    8174633 |  330050 | -33.54628 |
> |    8 |   11411606 |  1672182 |    9191048 |  431767 | -19.45877 |
> |    9 |   16722304 |  1288625 |   10517284 |  103274 | -37.10625 |
> |   10 |   16746542 |  1280385 |   11564426 |  417218 | -30.94439 |
> |   20 |   24294957 |    70703 |   22756497 |  673936 |  -6.33243 |
> |   30 |   36284782 |    74340 |   34131455 |  391473 |  -5.93452 |
> |   40 |   44703162 |  1199709 |   45407108 |  289715 |   1.57471 |
> |   50 |   56311282 |   281418 |   56098040 |  607739 |  -0.37868 |
> |   60 |   64225811 |   247587 |   64302246 |  132059 |   0.11901 |
> |   70 |   76299457 |    99853 |   76282497 |   83910 |  -0.02223 |
> |  100 |  104214393 |    38642 |  104212524 |  244424 |  -0.00179 |
> | 1000 | 1016131215 |   245725 | 1017051744 | 2748280 |   0.09059 |
> | 2000 | 2007711297 |  1325094 | 2007628922 | 1421807 |  -0.00410 |
> +------+------------+----------+------------+---------+-----------+
> - With the patch, the min sleep duration is never below the requested
>   sleep duration
> 
> So indeed the penalty of using msleep is big for small sleep durations.

Just to make sure, this is with Rafael's proposal, using msleep() for
values >= 48ms = (12 * 1000/250)ms and usleep_range() otherwise?
Pierre Gondois Nov. 20, 2024, 12:59 p.m. UTC | #11
On 11/20/24 13:06, Dietmar Eggemann wrote:
> On 20/11/2024 10:01, Pierre Gondois wrote:
>>
>>
>> On 11/18/24 13:02, Rafael J. Wysocki wrote:
>>> Hi Hans,
>>>
>>> On Mon, Nov 18, 2024 at 12:38 PM Hans de Goede <hdegoede@redhat.com>
>>> wrote:
>>>>
>>>> Hi Rafael, Len,
>>>>
>>>> On 18-Nov-24 12:03 PM, Rafael J. Wysocki wrote:
>>>>> On Sat, Nov 16, 2024 at 12:11 AM Len Brown <lenb@kernel.org> wrote:
> 
> [...]
> 
>> FWIW, testing the above version on an Arm Juno platform by executing
>> the following method:
>>
>> Method (SLEE, 1, Serialized)  {
>>    Sleep(Arg0)
>> }
>>
>> _wo: without patch
>> _w: with patch
>> - Values in ns.
>> - Requesting to sleep X ms
>> - Tested over 10 iterations
>> - HZ=250
>> +------+------------+----------+------------+---------+-----------+
>> |   ms |    mean_wo |   std_wo |    mean_w  |  std_w  |     ratio |
>> +------+------------+----------+------------+---------+-----------+
>> |    1 |    8087797 |  2079703 |    1313920 |   55066 | -83.75429 |
>> |    2 |    7942471 |  2201985 |    2416064 |  111604 | -69.58044 |
>> |    3 |    8373704 |   144274 |    3632537 |  111037 | -56.61970 |
>> |    4 |    7946013 |  2214330 |    4606028 |  255838 | -42.03346 |
>> |    5 |   11418920 |  1673914 |    5955548 |  131862 | -47.84490 |
>> |    6 |   11427042 |  1677519 |    7045713 |  211439 | -38.34176 |
>> |    7 |   12301242 |   221580 |    8174633 |  330050 | -33.54628 |
>> |    8 |   11411606 |  1672182 |    9191048 |  431767 | -19.45877 |
>> |    9 |   16722304 |  1288625 |   10517284 |  103274 | -37.10625 |
>> |   10 |   16746542 |  1280385 |   11564426 |  417218 | -30.94439 |
>> |   20 |   24294957 |    70703 |   22756497 |  673936 |  -6.33243 |
>> |   30 |   36284782 |    74340 |   34131455 |  391473 |  -5.93452 |
>> |   40 |   44703162 |  1199709 |   45407108 |  289715 |   1.57471 |
>> |   50 |   56311282 |   281418 |   56098040 |  607739 |  -0.37868 |
>> |   60 |   64225811 |   247587 |   64302246 |  132059 |   0.11901 |
>> |   70 |   76299457 |    99853 |   76282497 |   83910 |  -0.02223 |
>> |  100 |  104214393 |    38642 |  104212524 |  244424 |  -0.00179 |
>> | 1000 | 1016131215 |   245725 | 1017051744 | 2748280 |   0.09059 |
>> | 2000 | 2007711297 |  1325094 | 2007628922 | 1421807 |  -0.00410 |
>> +------+------------+----------+------------+---------+-----------+
>> - With the patch, the min sleep duration is never below the requested
>>    sleep duration
>>
>> So indeed the penalty of using msleep is big for small sleep durations.
> 
> Just to make sure, this is with Rafael's proposal, using msleep() for
> values >= 48ms = (12 * 1000/250)ms and usleep_range() otherwise?
> 

Yes exact
Rafael J. Wysocki Nov. 20, 2024, 6:03 p.m. UTC | #12
On Tue, Nov 19, 2024 at 4:08 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> On 11/19/2024 5:42 AM, Rafael J. Wysocki wrote:
> > On Mon, Nov 18, 2024 at 3:35 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
> >>
> >>> And the argument seems to be that it is better to always use more
> >>> resources in a given path (ACPI sleep in this particular case) than to
> >>> be somewhat inaccurate which is visible in some cases.
> >>>
> >>> This would mean that hrtimers should always be used everywhere, but they aren't.
> >>
> >> more or less rule of thumb is that regular timers are optimized for not firing case
> >> (e.g. timeouts that get deleted when the actual event happens) while hrtimers
> >> are optimized for the case where the timer is expected to fire.
> >
> > I've heard that, which makes me wonder why msleep() is still there.
> >
> > One thing that's rarely mentioned is that programming a timer in HW
> > actually takes time, so if it is done too often, it hurts performance
> > through latency (even if this is the TSC deadline timer).
>
> yup and this is why you want to group events together "somewhat", and which is why
> we have slack, to allow that to happen

So what do you think would be the minimum slack to use in this case?

I thought about something on the order of 199 us, but now I'm thinking
that 50 us would work too.  Less than this - I'm not sure.
Len Brown Nov. 20, 2024, 6:35 p.m. UTC | #13
On Mon, Nov 18, 2024 at 6:03 AM Rafael J. Wysocki <rafael@kernel.org> wrote:

> if (ms >= 12 * MSEC_PER_SEC / HZ) {
>         msleep(ms);
> } else {
>        u64 us = ms * USEC_PER_MSEC;
>
>       usleep_range(us, us / 8);
> }

The purpose of delaying a timer via timer slack is to afford the
opportunity to coalesce timers, when the precision wasn't actually
necessary.

I agree that  precision is not necessary here -- we are talking about
acpi_os_sleep(), which has a granularity of 1ms.

So the question becomes how much coalescing do we get for how much delay?

In the suspend/resume flows that sparked this discussion, the answer
is zero --- and so for that case, zero timer slack is justified.

I acknowledge that you refuse to apply my patch.

However, if you are going to add slack, I would like it to be clear
why users tasks, which may invoke HR timers at a dizzying rate, are
subject to a flat 50us timerslack_ns, while the entire ACPI
sub-system, which invokes a single timer, would see a 650usec delay
added for 5,000 usec sleep.
Arjan van de Ven Nov. 20, 2024, 6:37 p.m. UTC | #14
On 11/20/2024 10:03 AM, Rafael J. Wysocki wrote:
> On Tue, Nov 19, 2024 at 4:08 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
>>
>> On 11/19/2024 5:42 AM, Rafael J. Wysocki wrote:
>>> On Mon, Nov 18, 2024 at 3:35 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
>>>>
>>>>> And the argument seems to be that it is better to always use more
>>>>> resources in a given path (ACPI sleep in this particular case) than to
>>>>> be somewhat inaccurate which is visible in some cases.
>>>>>
>>>>> This would mean that hrtimers should always be used everywhere, but they aren't.
>>>>
>>>> more or less rule of thumb is that regular timers are optimized for not firing case
>>>> (e.g. timeouts that get deleted when the actual event happens) while hrtimers
>>>> are optimized for the case where the timer is expected to fire.
>>>
>>> I've heard that, which makes me wonder why msleep() is still there.
>>>
>>> One thing that's rarely mentioned is that programming a timer in HW
>>> actually takes time, so if it is done too often, it hurts performance
>>> through latency (even if this is the TSC deadline timer).
>>
>> yup and this is why you want to group events together "somewhat", and which is why
>> we have slack, to allow that to happen
> 
> So what do you think would be the minimum slack to use in this case?
> 
> I thought about something on the order of 199 us, but now I'm thinking
> that 50 us would work too.  Less than this - I'm not sure.

50 usec is likely more than enough in practice.
Rafael J. Wysocki Nov. 20, 2024, 6:49 p.m. UTC | #15
On Wed, Nov 20, 2024 at 7:38 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> On 11/20/2024 10:03 AM, Rafael J. Wysocki wrote:
> > On Tue, Nov 19, 2024 at 4:08 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
> >>
> >> On 11/19/2024 5:42 AM, Rafael J. Wysocki wrote:
> >>> On Mon, Nov 18, 2024 at 3:35 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
> >>>>
> >>>>> And the argument seems to be that it is better to always use more
> >>>>> resources in a given path (ACPI sleep in this particular case) than to
> >>>>> be somewhat inaccurate which is visible in some cases.
> >>>>>
> >>>>> This would mean that hrtimers should always be used everywhere, but they aren't.
> >>>>
> >>>> more or less rule of thumb is that regular timers are optimized for not firing case
> >>>> (e.g. timeouts that get deleted when the actual event happens) while hrtimers
> >>>> are optimized for the case where the timer is expected to fire.
> >>>
> >>> I've heard that, which makes me wonder why msleep() is still there.
> >>>
> >>> One thing that's rarely mentioned is that programming a timer in HW
> >>> actually takes time, so if it is done too often, it hurts performance
> >>> through latency (even if this is the TSC deadline timer).
> >>
> >> yup and this is why you want to group events together "somewhat", and which is why
> >> we have slack, to allow that to happen
> >
> > So what do you think would be the minimum slack to use in this case?
> >
> > I thought about something on the order of 199 us, but now I'm thinking
> > that 50 us would work too.  Less than this - I'm not sure.
>
> 50 usec is likely more than enough in practice.

And would you use the same slack value regardless of the sleep
duration, or make it somehow depend on the sleep duration?
Len Brown Nov. 20, 2024, 6:54 p.m. UTC | #16
On Wed, Nov 20, 2024 at 1:50 PM Rafael J. Wysocki <rafael@kernel.org> wrote:

> > 50 usec is likely more than enough in practice.
>
> And would you use the same slack value regardless of the sleep
> duration, or make it somehow depend on the sleep duration?

timerslack_ns is 50 usec for all user-space, no matter the duration.

This part was done right -- it doesn't depend on the sleep duration.

Coalescing depends on the pattern of wakeups over time.
That pattern doesn't necessarily depend on sleep duration,
so it is a hard sell to tie them together.
Arjan van de Ven Nov. 20, 2024, 7:18 p.m. UTC | #17
On 11/20/2024 10:49 AM, Rafael J. Wysocki wrote:
>>> I thought about something on the order of 199 us, but now I'm thinking
>>> that 50 us would work too.  Less than this - I'm not sure.
>>
>> 50 usec is likely more than enough in practice.
> 
> And would you use the same slack value regardless of the sleep
> duration, or make it somehow depend on the sleep duration?

I don't see why you'd make it dependent on the sleep duration
sure in theory the longer the sleep -- you could pick a fixed percentage

but you're trying to amortize a theoretical timer register write, and a
cpu wakeup. the timer write is fixed cost and not THAT expensive after
some amount of this . the C state wake up --- sure that is more variable
but that is super important for high occurance things (thousands to millions
of times per hour).
If your ACPI sleeps are high occurance on a system I suspect you have way
bigger problems than an occasional extra wakeup
Rafael J. Wysocki Nov. 20, 2024, 7:27 p.m. UTC | #18
On Wed, Nov 20, 2024 at 7:55 PM Len Brown <lenb@kernel.org> wrote:
>
> On Wed, Nov 20, 2024 at 1:50 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> > > 50 usec is likely more than enough in practice.
> >
> > And would you use the same slack value regardless of the sleep
> > duration, or make it somehow depend on the sleep duration?
>
> timerslack_ns is 50 usec for all user-space, no matter the duration.
>
> This part was done right -- it doesn't depend on the sleep duration.
>
> Coalescing depends on the pattern of wakeups over time.

So say there are two tasks that each sleep for 5 ms every 10 ms and
that the first one starts its first sleep 70 us before the other one.

That is, the first one always calles usleep_range(5000, 5000 + delta)
70 us before the other one.

If delta is 50 us, their timers will never coalesce, but if delta is
100 us, they'll always do.

Generally speaking, longer delta values make coalescing more likely.

> That pattern doesn't necessarily depend on sleep duration,
> so it is a hard sell to tie them together.

But it can be argued that delta can be somewhat larger (increasing the
likelihood of timer coalescing) for longer sleeps because they
generally don't need high precision.
Rafael J. Wysocki Nov. 20, 2024, 7:41 p.m. UTC | #19
On Wed, Nov 20, 2024 at 8:18 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> On 11/20/2024 10:49 AM, Rafael J. Wysocki wrote:
> >>> I thought about something on the order of 199 us, but now I'm thinking
> >>> that 50 us would work too.  Less than this - I'm not sure.
> >>
> >> 50 usec is likely more than enough in practice.
> >
> > And would you use the same slack value regardless of the sleep
> > duration, or make it somehow depend on the sleep duration?
>
> I don't see why you'd make it dependent on the sleep duration
> sure in theory the longer the sleep -- you could pick a fixed percentage
>
> but you're trying to amortize a theoretical timer register write, and a
> cpu wakeup. the timer write is fixed cost and not THAT expensive after
> some amount of this . the C state wake up --- sure that is more variable
> but that is super important for high occurance things (thousands to millions
> of times per hour).
> If your ACPI sleeps are high occurance on a system I suspect you have way
> bigger problems than an occasional extra wakeup

I generally think that if you are sleeping relatively long, you may as
well sacrifice some precision for avoiding system stress so to speak,
so I've been considering something like flat 50 us for sleeps between
1 and 5 ms and then 1% of the sleep duration for longer sleeps.
Len Brown Nov. 21, 2024, 10:33 a.m. UTC | #20
On Wed, Nov 20, 2024 at 2:42 PM Rafael J. Wysocki <rafael@kernel.org> wrote:

> I generally think that if you are sleeping relatively long, you may as
> well sacrifice some precision for avoiding system stress so to speak,
> so I've been considering something like flat 50 us for sleeps between
> 1 and 5 ms and then 1% of the sleep duration for longer sleeps.

What is the maximum rate of acpi_os_sleep() invocations?

Assuming the reasoning for user-space timer-slack is sound @ fixed 50 usec,
what logic supports acpi_os_sleep paying more timer slack delay than user space?

What measurements can demonstrate the benefit of this proposed additional delay?
diff mbox series

Patch

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 70af3fbbebe5..daf87e33b8ea 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -607,7 +607,9 @@  acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
 
 void acpi_os_sleep(u64 ms)
 {
-	msleep(ms);
+	u64 us = ms * USEC_PER_MSEC;
+
+	usleep_range(us, us);
 }
 
 void acpi_os_stall(u32 us)