[v5,0/4] mm: introduce THP deferred setting

Message ID	20250428182904.93989-1-npache@redhat.com
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 847A5253F3A for <linux-kselftest@vger.kernel.org>; Mon, 28 Apr 2025 18:29:30 +0000 (UTC) From: Nico Pache <npache@redhat.com> To: linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: akpm@linux-foundation.org, corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, david@redhat.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, ryan.roberts@arm.com, willy@infradead.org, peterx@redhat.com, shuah@kernel.org, ziy@nvidia.com, wangkefeng.wang@huawei.com, usamaarif642@gmail.com, sunnanyong@huawei.com, vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com, yang@os.amperecomputing.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com, jack@suse.cz, cl@gentwo.org, jglisse@google.com, surenb@google.com, zokeefe@google.com, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, hannes@cmpxchg.org, rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org Subject: [PATCH v5 0/4] mm: introduce THP deferred setting Date: Mon, 28 Apr 2025 12:29:00 -0600 Message-ID: <20250428182904.93989-1-npache@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	mm: introduce THP deferred setting \| expand [v5,0/4] mm: introduce THP deferred setting [v5,2/4] mm: document (m)THP defer usage [v5,4/4] selftests: mm: add defer to thp setting parser

Message ID

20250428182904.93989-1-npache@redhat.com

Headers

From: Nico Pache <npache@redhat.com>
To: linux-mm@kvack.org,
	linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org
Cc: akpm@linux-foundation.org,
	corbet@lwn.net,
	rostedt@goodmis.org,
	mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com,
	david@redhat.com,
	baohua@kernel.org,
	baolin.wang@linux.alibaba.com,
	ryan.roberts@arm.com,
	willy@infradead.org,
	peterx@redhat.com,
	shuah@kernel.org,
	ziy@nvidia.com,
	wangkefeng.wang@huawei.com,
	usamaarif642@gmail.com,
	sunnanyong@huawei.com,
	vishal.moola@gmail.com,
	thomas.hellstrom@linux.intel.com,
	yang@os.amperecomputing.com,
	kirill.shutemov@linux.intel.com,
	aarcange@redhat.com,
	raquini@redhat.com,
	dev.jain@arm.com,
	anshuman.khandual@arm.com,
	catalin.marinas@arm.com,
	tiwai@suse.de,
	will@kernel.org,
	dave.hansen@linux.intel.com,
	jack@suse.cz,
	cl@gentwo.org,
	jglisse@google.com,
	surenb@google.com,
	zokeefe@google.com,
	Liam.Howlett@oracle.com,
	lorenzo.stoakes@oracle.com,
	hannes@cmpxchg.org,
	rientjes@google.com,
	mhocko@suse.com,
	rdunlap@infradead.org
Subject: [PATCH v5 0/4] mm: introduce THP deferred setting
Date: Mon, 28 Apr 2025 12:29:00 -0600
Message-ID: <20250428182904.93989-1-npache@redhat.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

mm: introduce THP deferred setting | expand

Message

Nico Pache April 28, 2025, 6:29 p.m. UTC

This series is a follow-up to [1], which adds mTHP support to khugepaged.
mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
configs to make sense. Without it global="defer" and  mTHP="inherit" case
is "undefined" behavior.

We've seen cases were customers switching from RHEL7 to RHEL8 see a
significant increase in the memory footprint for the same workloads.

Through our investigations we found that a large contributing factor to
the increase in RSS was an increase in THP usage.

For workloads like MySQL, or when using allocators like jemalloc, it is
often recommended to set /transparent_hugepages/enabled=never. This is
in part due to performance degradations and increased memory waste.

This series introduces enabled=defer, this setting acts as a middle
ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
page fault handler will act normally, making a hugepage if possible. If
the allocation is not MADV_HUGEPAGE, then the page fault handler will
default to the base size allocation. The caveat is that khugepaged can
still operate on pages that are not MADV_HUGEPAGE.

This allows for three things... one, applications specifically designed to
use hugepages will get them, and two, applications that don't use
hugepages can still benefit from them without aggressively inserting
THPs at every possible chance. This curbs the memory waste, and defers
the use of hugepages to khugepaged. Khugepaged can then scan the memory
for eligible collapsing. Lastly there is the added benefit for those who
want THPs but experience higher latency PFs. Now you can get base page
performance at the PF handler and Hugepage performance for those mappings
after they collapse.

Admins may want to lower max_ptes_none, if not, khugepaged may
aggressively collapse single allocations into hugepages.

TESTING:
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- In [1] I provided a script [2] that has multiple access patterns
- lots of general use.
- redis testing. This test was my original case for the defer mode. What I
   was able to prove was that THP=always leads to increased max_latency
   cases; hence why it is recommended to disable THPs for redis servers.
   However with 'defer' we dont have the max_latency spikes and can still
   get the system to utilize THPs. I further tested this with the mTHP
   defer setting and found that redis (and probably other jmalloc users)
   can utilize THPs via defer (+mTHP defer) without a large latency
   penalty and some potential gains. I uploaded some mmtest results
   here[3] which compares:
       stock+thp=never
       stock+(m)thp=always
       khugepaged-mthp + defer (max_ptes_none=64)

  The results show that (m)THPs can cause some throughput regression in
  some cases, but also has gains in other cases. The mTHP+defer results
  have more gains and less losses over the (m)THP=always case.

V5 Changes:
- rebased dependent series
- added reviewed-by tag on 2/4

V4 Changes:
- Minor Documentation fixes
- rebased the dependent series [1] onto mm-unstable
    commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")

V3 Changes:
- Combined the documentation commits into one, and moved a section to the
  khugepaged mthp patchset

V2 Changes:
- base changes on mTHP khugepaged support
- Fix selftests parsing issue
- add mTHP defer option
- add mTHP defer Documentation

[1] - https://lore.kernel.org/lkml/20250428181218.85925-1-npache@redhat.com/
[2] - https://gitlab.com/npache/khugepaged_mthp_test
[3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html

Nico Pache (4):
  mm: defer THP insertion to khugepaged
  mm: document (m)THP defer usage
  khugepaged: add defer option to mTHP options
  selftests: mm: add defer to thp setting parser

 Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
 include/linux/huge_mm.h                    | 18 +++++-
 mm/huge_memory.c                           | 69 +++++++++++++++++++---
 mm/khugepaged.c                            |  8 +--
 tools/testing/selftests/mm/thp_settings.c  |  1 +
 tools/testing/selftests/mm/thp_settings.h  |  1 +
 6 files changed, 106 insertions(+), 22 deletions(-)

Comments

Nico Pache April 30, 2025, 6:39 p.m. UTC | #1

On Tue, Apr 29, 2025 at 7:49 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 28 Apr 2025, at 14:29, Nico Pache wrote:
>
> > setting /transparent_hugepages/enabled=always allows applications
> > to benefit from THPs without having to madvise. However, the pf handler
>
> s/pf/page fault
>
> > takes very few considerations to decide weather or not to actually use a
>
> s/weather/whether
>
> > THP. This can lead to a lot of wasted memory. khugepaged only operates
> > on memory that was either allocated with enabled=always or MADV_HUGEPAGE.
> >
> > Introduce the ability to set enabled=defer, which will prevent THPs from
> > being allocated by the page fault handler unless madvise is set,
> > leaving it up to khugepaged to decide which allocations will collapse to a
> > THP. This should allow applications to benefits from THPs, while curbing
> > some of the memory waste.
> >
> > Co-developed-by: Rafael Aquini <raquini@redhat.com>
> > Signed-off-by: Rafael Aquini <raquini@redhat.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  include/linux/huge_mm.h | 15 +++++++++++++--
> >  mm/huge_memory.c        | 31 +++++++++++++++++++++++++++----
> >  2 files changed, 40 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index e3d15c737008..57e6c962afb1 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -48,6 +48,7 @@ enum transparent_hugepage_flag {
> >       TRANSPARENT_HUGEPAGE_UNSUPPORTED,
> >       TRANSPARENT_HUGEPAGE_FLAG,
> >       TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> > +     TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG,
>
> What does INST mean here? Can you add one sentence on this new flag
> in the commit log to explain what it is short for?
"INSERT". Someone else commented on the length of this FLAG name. I
forgot to update it.
I can shorten it to something like ..DEFER_FLAG or DEFER_PF_FLAG
>
>
> >       TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
> >       TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> >       TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
> > @@ -186,6 +187,7 @@ static inline bool hugepage_global_enabled(void)
> >  {
> >       return transparent_hugepage_flags &
> >                       ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
> > +                     (1<<TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG) |
> >                       (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
> >  }
> >
> > @@ -195,6 +197,12 @@ static inline bool hugepage_global_always(void)
> >                       (1<<TRANSPARENT_HUGEPAGE_FLAG);
> >  }
> >
> > +static inline bool hugepage_global_defer(void)
> > +{
> > +     return transparent_hugepage_flags &
> > +                     (1<<TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG);
> > +}
> > +
> >  static inline int highest_order(unsigned long orders)
> >  {
> >       return fls_long(orders) - 1;
> > @@ -291,13 +299,16 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >                                      unsigned long tva_flags,
> >                                      unsigned long orders)
> >  {
> > +     if ((tva_flags & TVA_IN_PF) && hugepage_global_defer() &&
> > +                     !(vm_flags & VM_HUGEPAGE))
> > +             return 0;
> > +
> >       /* Optimization to check if required orders are enabled early. */
> >       if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
> >               unsigned long mask = READ_ONCE(huge_anon_orders_always);
> > -
>
> This newline should stay, right?
Yes, I can fix that.
>
> The rest looks good to me. Thanks. Acked-by: Zi Yan <ziy@nvidia.com>
Thank you!
-- Nico
>
> Best Regards,
> Yan, Zi
>

Zi Yan April 30, 2025, 8:34 p.m. UTC | #2

On 28 Apr 2025, at 14:29, Nico Pache wrote:

> Now that we have defer to globally disable THPs at fault time, lets add
> a defer setting to the mTHP options. This will allow khugepaged to
> operate at that order, while avoiding it at PF time.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  include/linux/huge_mm.h |  5 +++++
>  mm/huge_memory.c        | 38 +++++++++++++++++++++++++++++++++-----
>  mm/khugepaged.c         |  8 ++++----
>  3 files changed, 42 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 57e6c962afb1..a877c59bea67 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -96,6 +96,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  #define TVA_SMAPS		(1 << 0)	/* Will be used for procfs */
>  #define TVA_IN_PF		(1 << 1)	/* Page fault handler */
>  #define TVA_ENFORCE_SYSFS	(1 << 2)	/* Obey sysfs configuration */
> +#define TVA_IN_KHUGEPAGE	((1 << 2) | (1 << 3)) /* Khugepaged defer support */

Why is TVA_IN_KHUGEPAGE a superset of TVA_ENFORCE_SYSFS? Because khugepaged
also obeys sysfs configuration?

I wonder if explicitly coding the behavior is better. For example,
in __thp_vma_allowable_orders(), enforce_sysfs = tva_flags & (TVA_ENFORCE_SYSFS | TVA_IN_KHUGEPAGE).

>
>  #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
>  	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
> @@ -182,6 +183,7 @@ extern unsigned long transparent_hugepage_flags;
>  extern unsigned long huge_anon_orders_always;
>  extern unsigned long huge_anon_orders_madvise;
>  extern unsigned long huge_anon_orders_inherit;
> +extern unsigned long huge_anon_orders_defer;
>
>  static inline bool hugepage_global_enabled(void)
>  {
> @@ -306,6 +308,9 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  	/* Optimization to check if required orders are enabled early. */
>  	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {

And code here becomes tva_flags & (TVA_ENFORCE_SYSFS | TVA_IN_KHUGEPAGE).

Otherwise, LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

Nico Pache May 1, 2025, 10:53 p.m. UTC | #3

On Wed, Apr 30, 2025 at 2:34 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 28 Apr 2025, at 14:29, Nico Pache wrote:
>
> > Now that we have defer to globally disable THPs at fault time, lets add
> > a defer setting to the mTHP options. This will allow khugepaged to
> > operate at that order, while avoiding it at PF time.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  include/linux/huge_mm.h |  5 +++++
> >  mm/huge_memory.c        | 38 +++++++++++++++++++++++++++++++++-----
> >  mm/khugepaged.c         |  8 ++++----
> >  3 files changed, 42 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 57e6c962afb1..a877c59bea67 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -96,6 +96,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
> >  #define TVA_SMAPS            (1 << 0)        /* Will be used for procfs */
> >  #define TVA_IN_PF            (1 << 1)        /* Page fault handler */
> >  #define TVA_ENFORCE_SYSFS    (1 << 2)        /* Obey sysfs configuration */
> > +#define TVA_IN_KHUGEPAGE     ((1 << 2) | (1 << 3)) /* Khugepaged defer support */
>
> Why is TVA_IN_KHUGEPAGE a superset of TVA_ENFORCE_SYSFS? Because khugepaged
> also obeys sysfs configuration?
Correct! The need for a TVA_IN_KHUGEPAGED is to isolate the "deferred"
mTHPs from being "allowed" unless we are in khugepaged.
>
> I wonder if explicitly coding the behavior is better. For example,
> in __thp_vma_allowable_orders(), enforce_sysfs = tva_flags & (TVA_ENFORCE_SYSFS | TVA_IN_KHUGEPAGE).
I'm rather indifferent about either approach. If you (or any others)
have a strong preference for an explicit (none-supersetted) TVA flag I
can make the change.
>
> >
> >  #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
> >       (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
> > @@ -182,6 +183,7 @@ extern unsigned long transparent_hugepage_flags;
> >  extern unsigned long huge_anon_orders_always;
> >  extern unsigned long huge_anon_orders_madvise;
> >  extern unsigned long huge_anon_orders_inherit;
> > +extern unsigned long huge_anon_orders_defer;
> >
> >  static inline bool hugepage_global_enabled(void)
> >  {
> > @@ -306,6 +308,9 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >       /* Optimization to check if required orders are enabled early. */
> >       if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
>
> And code here becomes tva_flags & (TVA_ENFORCE_SYSFS | TVA_IN_KHUGEPAGE).
or just (enforce_sysfs & vma_is_anon)  like you mentioned. Then we
check for the TVA_IN_KHUGEPAGED before appending the defer bits.
>
> Otherwise, LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
Thanks !
>
> --
> Best Regards,
> Yan, Zi
>