mbox series

[v9,00/15] arm64: support poll_idle()

Message ID 20241107190818.522639-1-ankur.a.arora@oracle.com
Headers show
Series arm64: support poll_idle() | expand

Message

Ankur Arora Nov. 7, 2024, 7:08 p.m. UTC
This patchset adds support for polling in idle via poll_idle() on
arm64.

There are two main changes in this version:

1. rework the series to take Catalin Marinas' comments on the semantics
   of smp_cond_load_relaxed() (and how earlier versions of this
   series were abusing them) into account.

   This also allows dropping of the somewhat strained connections
   between haltpoll and the event-stream.

2. earlier versions of this series were adding support for poll_idle()
   but only using it in the haltpoll driver. Add Lifeng's patch to
   broaden it out by also polling in acpi-idle.

The benefit of polling in idle is to reduce the cost of remote wakeups.
When enabled, these can be done just by setting the need-resched bit,
instead of sending an IPI, and incurring the cost of handling the
interrupt on the receiver side. When running on a VM it also saves
the cost of WFE trapping (when enabled.)

Comparing sched-pipe performance on a guest VM:

# perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
  perf bench sched pipe -l 1000000 -c 4

# no polling in idle

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
    45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
    26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
                 0      sched:sched_wake_idle_without_ipi #    0.000 /sec

            12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )


# polling in idle (with haltpoll):

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
    34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
    20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
         1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                       ( +-  0.78% )

             7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )

Tomohiro Misono and Haris Okanovic also report similar latency
improvements on Grace and Graviton systems (for v7) [1] [2].
Lifeng also reports improved context switch latency on a bare-metal
machine with acpi-idle [3].

The series is in four parts:

 - patches 1-4,

    "asm-generic: add barrier smp_cond_load_relaxed_timeout()"
    "cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout()"
    "cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL"
    "Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig"

   add smp_cond_load_relaxed_timeout() and switch poll_idle() to
   using it. Also, do some munging of related kconfig options.

 - patches 5-7,

    "arm64: barrier: add support for smp_cond_relaxed_timeout()"
    "arm64: define TIF_POLLING_NRFLAG"
    "arm64: add support for polling in idle"

   add support for the new barrier, the polling flag and enable
   poll_idle() support.

 - patches 8, 9-13,

    "ACPI: processor_idle: Support polling state for LPI"

    "cpuidle-haltpoll: define arch_haltpoll_want()"
    "governors/haltpoll: drop kvm_para_available() check"
    "cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL"
    "arm64: idle: export arch_cpu_idle"
    "arm64: support cpuidle-haltpoll"

    add support for polling via acpi-idle, and cpuidle-haltpoll.

  - patches 14, 15,
     "arm64/delay: move some constants out to a separate header"
     "arm64: support WFET in smp_cond_relaxed_timeout()"

    are RFC patches to enable WFET support.

Changelog:

v9:

 - reworked the series to address a comment from Catalin Marinas
   about how v8 was abusing semantics of smp_cond_load_relaxed().
 - add poll_idle() support in acpi-idle (Lifeng Zheng)
 - dropped some earlier "Tested-by", "Reviewed-by" due to the
   above rework.

v8: No logic changes. Largely respin of v7, with changes
noted below:

 - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
   own patch.
   (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
   
 - address comments simplifying arm64 support (Will Deacon)
   (patch-11 "arm64: support cpuidle-haltpoll")

v7: No significant logic changes. Mostly a respin of v6.

 - minor cleanup in poll_idle() (Christoph Lameter)
 - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
   (Tomohiro Misono)

v6:

 - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
   changes together (comment from Christoph Lameter)
 - threshes out the commit messages a bit more (comments from Christoph
   Lameter, Sudeep Holla)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - moved back to arch_haltpoll_want() (comment from Joao Martins)
   Also, arch_haltpoll_want() now takes the force parameter and is
   now responsible for the complete selection (or not) of haltpoll.
 - fixes the build breakage on i386
 - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
   Tomohiro Misono, Haris Okanovic)

v5:
 - rework the poll_idle() loop around smp_cond_load_relaxed() (review
   comment from Tomohiro Misono.)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
   arm64 now depends on the event-stream being enabled.
 - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
 - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.

v4 changes from v3:
 - change 7/8 per Rafael input: drop the parens and use ret for the final check
 - add 8/8 which renames the guard for building poll_state

v3 changes from v2:
 - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
 - add Ack-by from Rafael Wysocki on 2/7

v2 changes from v1:
 - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
   (this improves by 50% at least the CPU cycles consumed in the tests above:
   10,716,881,137 now vs 14,503,014,257 before)
 - removed the ifdef from patch 1 per RafaelW

Please review.

[1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
[2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
[3] https://lore.kernel.org/lkml/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/

Ankur Arora (10):
  asm-generic: add barrier smp_cond_load_relaxed_timeout()
  cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout()
  cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
  arm64: barrier: add support for smp_cond_relaxed_timeout()
  arm64: add support for polling in idle
  cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
  arm64: idle: export arch_cpu_idle
  arm64: support cpuidle-haltpoll
  arm64/delay: move some constants out to a separate header
  arm64: support WFET in smp_cond_relaxed_timeout()

Joao Martins (4):
  Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
  arm64: define TIF_POLLING_NRFLAG
  cpuidle-haltpoll: define arch_haltpoll_want()
  governors/haltpoll: drop kvm_para_available() check

Lifeng Zheng (1):
  ACPI: processor_idle: Support polling state for LPI

 arch/Kconfig                              |  3 ++
 arch/arm64/Kconfig                        |  7 +++
 arch/arm64/include/asm/barrier.h          | 62 ++++++++++++++++++++++-
 arch/arm64/include/asm/cmpxchg.h          | 26 ++++++----
 arch/arm64/include/asm/cpuidle_haltpoll.h | 20 ++++++++
 arch/arm64/include/asm/delay-const.h      | 25 +++++++++
 arch/arm64/include/asm/thread_info.h      |  2 +
 arch/arm64/kernel/idle.c                  |  1 +
 arch/arm64/lib/delay.c                    | 13 ++---
 arch/x86/Kconfig                          |  5 +-
 arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
 arch/x86/kernel/kvm.c                     | 13 +++++
 drivers/acpi/processor_idle.c             | 43 +++++++++++++---
 drivers/cpuidle/Kconfig                   |  5 +-
 drivers/cpuidle/Makefile                  |  2 +-
 drivers/cpuidle/cpuidle-haltpoll.c        | 12 +----
 drivers/cpuidle/governors/haltpoll.c      |  6 +--
 drivers/cpuidle/poll_state.c              | 27 +++-------
 drivers/idle/Kconfig                      |  1 +
 include/asm-generic/barrier.h             | 42 +++++++++++++++
 include/linux/cpuidle.h                   |  2 +-
 include/linux/cpuidle_haltpoll.h          |  5 ++
 22 files changed, 252 insertions(+), 71 deletions(-)
 create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
 create mode 100644 arch/arm64/include/asm/delay-const.h