mbox series

[v2,0/5] implement lightweight guard pages

Message ID cover.1729440856.git.lorenzo.stoakes@oracle.com
Headers show
Series implement lightweight guard pages | expand

Message

Lorenzo Stoakes Oct. 20, 2024, 4:20 p.m. UTC
Userland library functions such as allocators and threading implementations
often require regions of memory to act as 'guard pages' - mappings which,
when accessed, result in a fatal signal being sent to the accessing
process.

The current means by which these are implemented is via a PROT_NONE mmap()
mapping, which provides the required semantics however incur an overhead of
a VMA for each such region.

With a great many processes and threads, this can rapidly add up and incur
a significant memory penalty. It also has the added problem of preventing
merges that might otherwise be permitted.

This series takes a different approach - an idea suggested by Vlasimil
Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
provenance becomes a little tricky to ascertain after this - please forgive
any omissions!)  - rather than locating the guard pages at the VMA layer,
instead placing them in page tables mapping the required ranges.

Early testing of the prototype version of this code suggests a 5 times
speed up in memory mapping invocations (in conjunction with use of
process_madvise()) and a 13% reduction in VMAs on an entirely idle android
system and unoptimised code.

We expect with optimisation and a loaded system with a larger number of
guard pages this could significantly increase, but in any case these
numbers are encouraging.

This way, rather than having separate VMAs specifying which parts of a
range are guard pages, instead we have a VMA spanning the entire range of
memory a user is permitted to access and including ranges which are to be
'guarded'.

After mapping this, a user can specify which parts of the range should
result in a fatal signal when accessed.

By restricting the ability to specify guard pages to memory mapped by
existing VMAs, we can rely on the mappings being torn down when the
mappings are ultimately unmapped and everything works simply as if the
memory were not faulted in, from the point of view of the containing VMAs.

This mechanism in effect poisons memory ranges similar to hardware memory
poisoning, only it is an entirely software-controlled form of poisoning.

Any poisoned region of memory is also able to 'unpoisoned', that is, to
have its poison markers removed.

The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
poisoning.

Poisoning can be performed across multiple VMAs and any existing mappings
will be cleared, that is zapped, before installing the poisoned page table
mappings.

There is no concept of 'nested' poisoning, multiple attempts to poison a
range will, after the first poisoning, have no effect.

Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
memory, so a user can safely unpoison a range of memory and clear only
poison page table mappings leaving the rest intact.

The actual mechanism by which the page table entries are specified makes
use of existing logic - PTE markers, which are used for the userfaultfd
UFFDIO_POISON mechanism.

Unfortunately PTE_MARKER_POISONED is not suited for the guard page
mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
logic to handle it.

We also extend the generic page walk mechanism to allow for installation of
PTEs (carefully restricted to memory management logic only to prevent
unwanted abuse).

We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
specified for a VMA which implies a total removal of memory
characteristics).

It's important to note that the guard page implementation is emphatically
NOT a security feature, so a user can remove the poisoning if they wish. We
simply implement it in such a way as to provide the least surprising
behaviour.

An extensive set of self-tests are provided which ensure behaviour is as
expected and additionally self-documents expected behaviour of poisoned
ranges.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>

v2
* The macros in kselftest_harness.h seem to be broken - __EXPECT() is
  terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not
  safe in single line if / else or for /which blocks, however working
  around this results in checkpatch producing invalid warnings, as reported
  by Shuah.
* Fixing these macros is out of scope for this series, so compromise and
  instead rewrite test blocks so as to use multiple lines by separating out
  a decl in most cases. This has the side effect of, for the most part,
  making things more readable.
* Heavily document the use of the volatile keyword - we can't avoid
  checkpatch complaining about this, so we explain it, as reported by
  Shuah.
* Updated commit message to highlight that we skip tests we lack
  permissions for, as reported by Shuah.
* Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah.
* Added user friendly messages to cases where tests are skipped due to lack
  of permissions, as reported by Shuah.
* Update the tool header to include the new MADV_GUARD_POISON/UNPOISON
  defines and directly include asm-generic/mman.h to get the
  platform-neutral versions to ensure we import them.
* Finally fixed Vlastimil's email address in Suggested-by tags from suze to
  suse, as reported by Vlastimil.
* Added linux-api to cc list, as reported by Vlastimil.

v1
* Un-RFC'd as appears no major objections to approach but rather debate on
  implementation.
* Fixed issue with arches which need mmu_context.h and
  tlbfush.h. header imports in pagewalker logic to be able to use
  update_mmu_cache() as reported by the kernel test bot.
* Added comments in page walker logic to clarify who can use
  ops->install_pte and why as well as adding a check_ops_valid() helper
  function, as suggested by Christoph.
* Pass false in full parameter in pte_clear_not_present_full() as suggested
  by Jann.
* Stopped erroneously requiring a write lock for the poison operation as
  suggested by Jann and Suren.
* Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
  consistent with how this is used elsewhere in the kernel as suggested by
  Jann.
* Avoid returning -EAGAIN if we are raced on page faults, just keep looping
  and duck out if a fatal signal is pending or a conditional reschedule is
  needed, as suggested by Jann.
* Avoid needlessly splitting huge PUDs and PMDs by specifying
  ACTION_CONTINUE, as suggested by Jann.
https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@oracle.com/

RFC
https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (5):
  mm: pagewalk: add the ability to install PTEs
  mm: add PTE_MARKER_GUARD PTE marker
  mm: madvise: implement lightweight guard page mechanism
  tools: testing: update tools UAPI header for mman-common.h
  selftests/mm: add self tests for guard page feature

 arch/alpha/include/uapi/asm/mman.h           |    3 +
 arch/mips/include/uapi/asm/mman.h            |    3 +
 arch/parisc/include/uapi/asm/mman.h          |    3 +
 arch/xtensa/include/uapi/asm/mman.h          |    3 +
 include/linux/mm_inline.h                    |    2 +-
 include/linux/pagewalk.h                     |   18 +-
 include/linux/swapops.h                      |   26 +-
 include/uapi/asm-generic/mman-common.h       |    3 +
 mm/hugetlb.c                                 |    3 +
 mm/internal.h                                |    6 +
 mm/madvise.c                                 |  168 +++
 mm/memory.c                                  |   18 +-
 mm/mprotect.c                                |    3 +-
 mm/mseal.c                                   |    1 +
 mm/pagewalk.c                                |  200 ++-
 tools/include/uapi/asm-generic/mman-common.h |    3 +
 tools/testing/selftests/mm/.gitignore        |    1 +
 tools/testing/selftests/mm/Makefile          |    1 +
 tools/testing/selftests/mm/guard-pages.c     | 1228 ++++++++++++++++++
 19 files changed, 1627 insertions(+), 66 deletions(-)
 create mode 100644 tools/testing/selftests/mm/guard-pages.c

--
2.47.0

Comments

Lorenzo Stoakes Oct. 20, 2024, 7:45 p.m. UTC | #1
On Sun, Oct 20, 2024 at 07:37:54PM +0200, Florian Weimer wrote:
> * Lorenzo Stoakes:
>
> > Early testing of the prototype version of this code suggests a 5 times
> > speed up in memory mapping invocations (in conjunction with use of
> > process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> > system and unoptimised code.
> >
> > We expect with optimisation and a loaded system with a larger number of
> > guard pages this could significantly increase, but in any case these
> > numbers are encouraging.
> >
> > This way, rather than having separate VMAs specifying which parts of a
> > range are guard pages, instead we have a VMA spanning the entire range of
> > memory a user is permitted to access and including ranges which are to be
> > 'guarded'.
> >
> > After mapping this, a user can specify which parts of the range should
> > result in a fatal signal when accessed.
> >
> > By restricting the ability to specify guard pages to memory mapped by
> > existing VMAs, we can rely on the mappings being torn down when the
> > mappings are ultimately unmapped and everything works simply as if the
> > memory were not faulted in, from the point of view of the containing VMAs.
>
> We have a glibc (so not Android) dynamic linker bug that asks us to
> remove PROT_NONE mappings in mapped shared objects:
>
>   Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=31076>
>
> It's slightly different from a guard page because our main goal is to
> avoid other mappings to end up in those gaps, which has been shown to
> cause odd application behavior in cases where it happens.  If I
> understand the series correctly, the kernel would not automatically
> attribute those PROT_NONE gaps to the previous or subsequent mapping.
> We would have to extend one of the surrounding mapps and apply
> MADV_POISON to that over-mapped part.  That doesn't seem too onerous.
>
> Could the ELF loader in the kernel do the same thing for the main
> executable and the program loader?

Currently this implementation is only available for private anonymous
memory. We may look at extending it to shmem and file-backed memory in the
future but there are a whole host of things to overcome to make that work
so it's one step at a time! :)
Lorenzo Stoakes Oct. 21, 2024, 1:50 p.m. UTC | #2
On Mon, Oct 21, 2024 at 03:27:55PM +0200, Vlastimil Babka wrote:
> On 10/20/24 18:20, Lorenzo Stoakes wrote:
> > The existing generic pagewalk logic permits the walking of page tables,
> > invoking callbacks at individual page table levels via user-provided
> > mm_walk_ops callbacks.
> >
> > This is useful for traversing existing page table entries, but precludes
> > the ability to establish new ones.
> >
> > Existing mechanism for performing a walk which also installs page table
> > entries if necessary are heavily duplicated throughout the kernel, each
> > with semantic differences from one another and largely unavailable for use
> > elsewhere.
> >
> > Rather than add yet another implementation, we extend the generic pagewalk
> > logic to enable the installation of page table entries by adding a new
> > install_pte() callback in mm_walk_ops. If this is specified, then upon
> > encountering a missing page table entry, we allocate and install a new one
> > and continue the traversal.
> >
> > If a THP huge page is encountered, we make use of existing logic to split
> > it. Then once we reach the PTE level, we invoke the install_pte() callback
> > which provides a PTE entry to install. We do not support hugetlb at this
> > stage.
> >
> > If this function returns an error, or an allocation fails during the
> > operation, we abort the operation altogether. It is up to the caller to
> > deal appropriately with partially populated page table ranges.
> >
> > If install_pte() is defined, the semantics of pte_entry() change - this
> > callback is then only invoked if the entry already exists. This is a useful
> > property, as it allows a caller to handle existing PTEs while installing
> > new ones where necessary in the specified range.
> >
> > If install_pte() is not defined, then there is no functional difference to
> > this patch, so all existing logic will work precisely as it did before.
> >
> > As we only permit the installation of PTEs where a mapping does not already
> > exist there is no need for TLB management, however we do invoke
> > update_mmu_cache() for architectures which require manual maintenance of
> > mappings for other CPUs.
> >
> > We explicitly do not allow the existing page walk API to expose this
> > feature as it is dangerous and intended for internal mm use only. Therefore
> > we provide a new walk_page_range_mm() function exposed only to
> > mm/internal.h.
> >
> > Reviewed-by: Jann Horn <jannh@google.com>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> <snip>
>
> >  /*
> >   * We want to know the real level where a entry is located ignoring any
> >   * folding of levels which may be happening. For example if p4d is folded then
> > @@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
> >  	int err = 0;
> >
> >  	for (;;) {
> > -		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> > -		if (err)
> > -		       break;
> > +		if (ops->install_pte && pte_none(ptep_get(pte))) {
> > +			pte_t new_pte;
> > +
> > +			err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte,
> > +					       walk);
> > +			if (err)
> > +				break;
> > +
> > +			set_pte_at(walk->mm, addr, pte, new_pte);
>
> While the guard pages install ptes unconditionally, maybe some install_pte
> handler implementation would sometimes want to skip, should ve define an
> error code that means its skipped and just continue instead of set_pte_at()?
> Or leave it until such handler appears.

I'm not sure under what circumstances you'd want to skip though precisely?
There's nothing populated, and the user is defining the range in which to
install a PTE if nothing is populated.

If they wanted more complicated handling they could do multiple, smaller
calls. Things are inherently racey with these walks so there's no benefit
in doing everything at once.

>
> > +			/* Non-present before, so for arches that need it. */
> > +			if (!WARN_ON_ONCE(walk->no_vma))
> > +				update_mmu_cache(walk->vma, addr, pte);
> > +		} else {
> > +			err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> > +			if (err)
> > +				break;
> > +		}
> >  		if (addr >= end - PAGE_SIZE)
> >  			break;
> >  		addr += PAGE_SIZE;
> > @@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >  again:
> >  		next = pmd_addr_end(addr, end);
> >  		if (pmd_none(*pmd)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __pte_alloc(walk->mm, pmd);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, depth, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >
> >  		walk->action = ACTION_SUBTREE;
> > @@ -116,7 +138,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >  		 */
> >  		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
> >  		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pte_entry))
> > +		    !(ops->pte_entry || ops->install_pte))
> >  			continue;
>
> BTW, I find it hard to read this condition even before your patch, oh well.

Agreed, this badly needs refactoring, but felt out of scope for this change.

> But if I read it correctly, does it mean we're going to split a pmd-mapped
> THP if we have a install_pte handler? But is that really necessary if the
> pmd splitting results in all ptes populated, and thus the install_pte
> handler can't do anything with any pte anyway?

Yes... however nothing else here in the logic has special handling for
transhuge pages AND there is already an interface provided to prevent this
if you want, which we use in commit 3/5, that is to provide pud, pmd
walkers that set walk->action = ACTION_CONTINUE if transhuge.

Having said that, it kind of sucks that we are doing a useless split
here. Hmm. In the pte_entry() case you might very well want to split and do
something with the PTE. With the install you are only interested if it's
non-present...

It's not _completely_ infeasible that a user would want this, but it's very
unlikely.

OK so yeah let's add it and clean up this expression while we're at it, will
fix on respin.

>
> >  		if (walk->vma)
> > @@ -148,11 +170,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >   again:
> >  		next = pud_addr_end(addr, end);
> >  		if (pud_none(*pud)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __pmd_alloc(walk->mm, pud, addr);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, depth, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >
> >  		walk->action = ACTION_SUBTREE;
> > @@ -167,7 +192,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >
> >  		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
> >  		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pmd_entry || ops->pte_entry))
> > +		    !(ops->pmd_entry || ops->pte_entry || ops->install_pte))
> >  			continue;
>
> Ditto?
>
Lorenzo Stoakes Oct. 21, 2024, 2:33 p.m. UTC | #3
On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
> On 10/20/24 18:20, Lorenzo Stoakes wrote:
> > Add a new PTE marker that results in any access causing the accessing
> > process to segfault.
> >
> > This is preferable to PTE_MARKER_POISONED, which results in the same
> > handling as hardware poisoned memory, and is thus undesirable for cases
> > where we simply wish to 'soft' poison a range.
> >
> > This is in preparation for implementing the ability to specify guard pages
> > at the page table level, i.e. ranges that, when accessed, should cause
> > process termination.
> >
> > Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
> > function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
> > purpose was simply incorrect.
> >
> > We then reuse the same logic to determine whether a zap should clear a
> > guard entry - this should only be performed on teardown and never on
> > MADV_DONTNEED or the like.
>
> Since I would have personally put MADV_FREE among "or the like" here, it's
> surprising to me that it in fact it's tearing down the guard entries now. Is
> that intentional? It should be at least mentioned very explicitly. But I'd
> really argue against it, as MADV_FREE is to me a weaker form of
> MADV_DONTNEED - the existing pages are not zapped immediately but
> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why
> shouldn't MADV_FREE too?

That is not, as I understand it, what MADV_FREE is, semantically. From the
man pages:

       MADV_FREE (since Linux 4.5)

              The application no longer requires the pages in the range
              specified by addr and len.  The kernel can thus free these
              pages, but the freeing could be delayed until memory pressure
              occurs.

       MADV_DONTNEED

              Do not expect access in the near future.  (For the time
              being, the application is finished with the given range, so
              the kernel can free resources associated with it.)

MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we
don't expect to use it in the near future'.

>
> Seems to me rather currently an artifact of MADV_FREE implementation - if it
> encounters hwpoison entries it will tear them down because why not, we have
> detected a hw memory error and are lucky the program wants to discard the
> pages and not access them, so best use the opportunity and get rid of the
> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly
> could).

Right, but we explicitly do not tear them down in the case of MADV_DONTNEED
which matches the description in the manpages that the user _might_ come
back to the range, whereas MADV_FREE means they are truly done but just
don't want the overhead of actually unmapping at this point.

Seems to be this is moreso that MADV_FREE is a not-really-as-efficient
version of what Rik wants to do with his MADV_LAZYFREE thing.

>
> But to extend this to guard PTEs which are result of an explicit userspace
> action feels wrong, unless the semantics is the same for MADV_DONTEED. The
> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave
> the same?

My understanding from the above is that MADV_FREE is a softer version of
munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a
'revert state to when I first mapped this stuff because I'm done with it
for now but might use it later'.

Obviously happy to be corrected if my understanding of this is off-the-mark
but this is what motivated me to do it this way!

>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  include/linux/mm_inline.h |  2 +-
> >  include/linux/swapops.h   | 26 ++++++++++++++++++++++++--
> >  mm/hugetlb.c              |  3 +++
> >  mm/memory.c               | 18 +++++++++++++++---
> >  4 files changed, 43 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index 355cf46a01a6..1b6a917fffa4 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -544,7 +544,7 @@ static inline pte_marker copy_pte_marker(
> >  {
> >  	pte_marker srcm = pte_marker_get(entry);
> >  	/* Always copy error entries. */
> > -	pte_marker dstm = srcm & PTE_MARKER_POISONED;
> > +	pte_marker dstm = srcm & (PTE_MARKER_POISONED | PTE_MARKER_GUARD);
> >
> >  	/* Only copy PTE markers if UFFD register matches. */
> >  	if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma))
> > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > index cb468e418ea1..4d0606df0791 100644
> > --- a/include/linux/swapops.h
> > +++ b/include/linux/swapops.h
> > @@ -426,9 +426,15 @@ typedef unsigned long pte_marker;
> >   * "Poisoned" here is meant in the very general sense of "future accesses are
> >   * invalid", instead of referring very specifically to hardware memory errors.
> >   * This marker is meant to represent any of various different causes of this.
> > + *
> > + * Note that, when encountered by the faulting logic, PTEs with this marker will
> > + * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error
> > + * logic.
> >   */
> >  #define  PTE_MARKER_POISONED			BIT(1)
> > -#define  PTE_MARKER_MASK			(BIT(2) - 1)
> > +/* Indicates that, on fault, this PTE will case a SIGSEGV signal to be sent. */
> > +#define  PTE_MARKER_GUARD			BIT(2)
> > +#define  PTE_MARKER_MASK			(BIT(3) - 1)
> >
> >  static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> >  {
> > @@ -461,9 +467,25 @@ static inline swp_entry_t make_poisoned_swp_entry(void)
> >  }
> >
> >  static inline int is_poisoned_swp_entry(swp_entry_t entry)
> > +{
> > +	/*
> > +	 * We treat guard pages as poisoned too as these have the same semantics
> > +	 * as poisoned ranges, only with different fault handling.
> > +	 */
> > +	return is_pte_marker_entry(entry) &&
> > +		(pte_marker_get(entry) &
> > +		 (PTE_MARKER_POISONED | PTE_MARKER_GUARD));
> > +}
> > +
> > +static inline swp_entry_t make_guard_swp_entry(void)
> > +{
> > +	return make_pte_marker_entry(PTE_MARKER_GUARD);
> > +}
> > +
> > +static inline int is_guard_swp_entry(swp_entry_t entry)
> >  {
> >  	return is_pte_marker_entry(entry) &&
> > -	    (pte_marker_get(entry) & PTE_MARKER_POISONED);
> > +		(pte_marker_get(entry) & PTE_MARKER_GUARD);
> >  }
> >
> >  /*
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 906294ac85dc..50e3f6ed73ac 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6353,6 +6353,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  				ret = VM_FAULT_HWPOISON_LARGE |
> >  				      VM_FAULT_SET_HINDEX(hstate_index(h));
> >  				goto out_mutex;
> > +			} else if (marker & PTE_MARKER_GUARD) {
> > +				ret = VM_FAULT_SIGSEGV;
> > +				goto out_mutex;
> >  			}
> >  		}
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0f614523b9f4..551455cd453f 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1455,7 +1455,7 @@ static inline bool should_zap_folio(struct zap_details *details,
> >  	return !folio_test_anon(folio);
> >  }
> >
> > -static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
> > +static inline bool zap_drop_markers(struct zap_details *details)
> >  {
> >  	if (!details)
> >  		return false;
> > @@ -1476,7 +1476,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> >  	if (vma_is_anonymous(vma))
> >  		return;
> >
> > -	if (zap_drop_file_uffd_wp(details))
> > +	if (zap_drop_markers(details))
> >  		return;
> >
> >  	for (;;) {
> > @@ -1671,7 +1671,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >  			 * drop the marker if explicitly requested.
> >  			 */
> >  			if (!vma_is_anonymous(vma) &&
> > -			    !zap_drop_file_uffd_wp(details))
> > +			    !zap_drop_markers(details))
> > +				continue;
> > +		} else if (is_guard_swp_entry(entry)) {
> > +			/*
> > +			 * Ordinary zapping should not remove guard PTE
> > +			 * markers. Only do so if we should remove PTE markers
> > +			 * in general.
> > +			 */
> > +			if (!zap_drop_markers(details))
> >  				continue;
> >  		} else if (is_hwpoison_entry(entry) ||
> >  			   is_poisoned_swp_entry(entry)) {
> > @@ -4003,6 +4011,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >  	if (marker & PTE_MARKER_POISONED)
> >  		return VM_FAULT_HWPOISON;
> >
> > +	/* Hitting a guard page is always a fatal condition. */
> > +	if (marker & PTE_MARKER_GUARD)
> > +		return VM_FAULT_SIGSEGV;
> > +
> >  	if (pte_marker_entry_uffd_wp(entry))
> >  		return pte_marker_handle_uffd_wp(vmf);
> >
>
Vlastimil Babka Oct. 21, 2024, 2:54 p.m. UTC | #4
On 10/21/24 16:33, Lorenzo Stoakes wrote:
> On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
>> On 10/20/24 18:20, Lorenzo Stoakes wrote:
>> > Add a new PTE marker that results in any access causing the accessing
>> > process to segfault.
>> >
>> > This is preferable to PTE_MARKER_POISONED, which results in the same
>> > handling as hardware poisoned memory, and is thus undesirable for cases
>> > where we simply wish to 'soft' poison a range.
>> >
>> > This is in preparation for implementing the ability to specify guard pages
>> > at the page table level, i.e. ranges that, when accessed, should cause
>> > process termination.
>> >
>> > Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
>> > function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
>> > purpose was simply incorrect.
>> >
>> > We then reuse the same logic to determine whether a zap should clear a
>> > guard entry - this should only be performed on teardown and never on
>> > MADV_DONTNEED or the like.
>>
>> Since I would have personally put MADV_FREE among "or the like" here, it's
>> surprising to me that it in fact it's tearing down the guard entries now. Is
>> that intentional? It should be at least mentioned very explicitly. But I'd
>> really argue against it, as MADV_FREE is to me a weaker form of
>> MADV_DONTNEED - the existing pages are not zapped immediately but
>> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why
>> shouldn't MADV_FREE too?
> 
> That is not, as I understand it, what MADV_FREE is, semantically. From the
> man pages:
> 
>        MADV_FREE (since Linux 4.5)
> 
>               The application no longer requires the pages in the range
>               specified by addr and len.  The kernel can thus free these
>               pages, but the freeing could be delayed until memory pressure
>               occurs.
> 
>        MADV_DONTNEED
> 
>               Do not expect access in the near future.  (For the time
>               being, the application is finished with the given range, so
>               the kernel can free resources associated with it.)
> 
> MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we
> don't expect to use it in the near future'.

I think the description gives a wrong impression. What I think matters it
what happens (limited to anon private case as MADV_FREE doesn't support any
other)

MADV_DONTNEED - pages discarded immediately, further access gives new
zero-filled pages

MADV_FREE - pages prioritized for discarding, if that happens before next
write, it gets zero-filled page on next access, but a write done soon enough
 can cancel the upcoming discard.

In that sense, MADV_FREE is a weaker form of DONTNEED, no?

>>
>> Seems to me rather currently an artifact of MADV_FREE implementation - if it
>> encounters hwpoison entries it will tear them down because why not, we have
>> detected a hw memory error and are lucky the program wants to discard the
>> pages and not access them, so best use the opportunity and get rid of the
>> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly
>> could).
> 
> Right, but we explicitly do not tear them down in the case of MADV_DONTNEED
> which matches the description in the manpages that the user _might_ come
> back to the range, whereas MADV_FREE means they are truly done but just
> don't want the overhead of actually unmapping at this point.

But it's also defined what happens if user comes back to the range after a
MADV_FREE. I think the overhead saved happens in the case of actually coming
back soon enough to prevent the discard. With MADV_DONTNEED its immediate
and unconditional.

> Seems to be this is moreso that MADV_FREE is a not-really-as-efficient
> version of what Rik wants to do with his MADV_LAZYFREE thing.

I think that further optimizes MADV_FREE, which is already more optimized
than MADV_DONTNEED.

>>
>> But to extend this to guard PTEs which are result of an explicit userspace
>> action feels wrong, unless the semantics is the same for MADV_DONTEED. The
>> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave
>> the same?
> 
> My understanding from the above is that MADV_FREE is a softer version of
> munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a
> 'revert state to when I first mapped this stuff because I'm done with it
> for now but might use it later'.
Lorenzo Stoakes Oct. 21, 2024, 3:33 p.m. UTC | #5
On Mon, Oct 21, 2024 at 04:54:06PM +0200, Vlastimil Babka wrote:
> On 10/21/24 16:33, Lorenzo Stoakes wrote:
> > On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
> >> On 10/20/24 18:20, Lorenzo Stoakes wrote:
> >> > Add a new PTE marker that results in any access causing the accessing
> >> > process to segfault.
> >> >
> >> > This is preferable to PTE_MARKER_POISONED, which results in the same
> >> > handling as hardware poisoned memory, and is thus undesirable for cases
> >> > where we simply wish to 'soft' poison a range.
> >> >
> >> > This is in preparation for implementing the ability to specify guard pages
> >> > at the page table level, i.e. ranges that, when accessed, should cause
> >> > process termination.
> >> >
> >> > Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
> >> > function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
> >> > purpose was simply incorrect.
> >> >
> >> > We then reuse the same logic to determine whether a zap should clear a
> >> > guard entry - this should only be performed on teardown and never on
> >> > MADV_DONTNEED or the like.
> >>
> >> Since I would have personally put MADV_FREE among "or the like" here, it's
> >> surprising to me that it in fact it's tearing down the guard entries now. Is
> >> that intentional? It should be at least mentioned very explicitly. But I'd
> >> really argue against it, as MADV_FREE is to me a weaker form of
> >> MADV_DONTNEED - the existing pages are not zapped immediately but
> >> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why
> >> shouldn't MADV_FREE too?
> >
> > That is not, as I understand it, what MADV_FREE is, semantically. From the
> > man pages:
> >
> >        MADV_FREE (since Linux 4.5)
> >
> >               The application no longer requires the pages in the range
> >               specified by addr and len.  The kernel can thus free these
> >               pages, but the freeing could be delayed until memory pressure
> >               occurs.
> >
> >        MADV_DONTNEED
> >
> >               Do not expect access in the near future.  (For the time
> >               being, the application is finished with the given range, so
> >               the kernel can free resources associated with it.)
> >
> > MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we
> > don't expect to use it in the near future'.
>
> I think the description gives a wrong impression. What I think matters it
> what happens (limited to anon private case as MADV_FREE doesn't support any
> other)
>
> MADV_DONTNEED - pages discarded immediately, further access gives new
> zero-filled pages
>
> MADV_FREE - pages prioritized for discarding, if that happens before next
> write, it gets zero-filled page on next access, but a write done soon enough
>  can cancel the upcoming discard.
>
> In that sense, MADV_FREE is a weaker form of DONTNEED, no?
>
> >>
> >> Seems to me rather currently an artifact of MADV_FREE implementation - if it
> >> encounters hwpoison entries it will tear them down because why not, we have
> >> detected a hw memory error and are lucky the program wants to discard the
> >> pages and not access them, so best use the opportunity and get rid of the
> >> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly
> >> could).
> >
> > Right, but we explicitly do not tear them down in the case of MADV_DONTNEED
> > which matches the description in the manpages that the user _might_ come
> > back to the range, whereas MADV_FREE means they are truly done but just
> > don't want the overhead of actually unmapping at this point.
>
> But it's also defined what happens if user comes back to the range after a
> MADV_FREE. I think the overhead saved happens in the case of actually coming
> back soon enough to prevent the discard. With MADV_DONTNEED its immediate
> and unconditional.
>
> > Seems to be this is moreso that MADV_FREE is a not-really-as-efficient
> > version of what Rik wants to do with his MADV_LAZYFREE thing.
>
> I think that further optimizes MADV_FREE, which is already more optimized
> than MADV_DONTNEED.
>
> >>
> >> But to extend this to guard PTEs which are result of an explicit userspace
> >> action feels wrong, unless the semantics is the same for MADV_DONTEED. The
> >> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave
> >> the same?
> >
> > My understanding from the above is that MADV_FREE is a softer version of
> > munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a
> > 'revert state to when I first mapped this stuff because I'm done with it
> > for now but might use it later'.
>
> From the implementation I get the opposite understanding. Neither tears down
> the vma like a proper unmap(). MADV_DONTNEED zaps page tables immediately,
> MADV_FREE effectively too but with a delay depending on memory pressure.
>

OK so based on IRC chat I think the conclusion here is TL;DR yes we have to
change this, you're right :)

To summarise for on-list:

* MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the
  ability to be 'cancelled' if you write to the memory. Also, after the
  freeing is complete, you can write to the memory to reuse it, the mapping
  is still there.

* For hardware poison markers it makes sense to drop them as you're
  effectively saying 'I am done with this range that is now unbacked and
  expect to get an empty page should I use it now'. UFFD WP I am not sure
  about but presumably also fine.

* However, guard pages are different - if you 'cancel' and you are left
  with a block of memory allocated to you by a pthread or userland
  allocator implementation, you don't want to then no longer be protected
  from overrunning into other thread memory.

So, while I implemented this based on a. the semantics as apparently
expressed in the man page and b. existing PTE marker behaviour, it seems
that it would actually be problematic to do so.

So yeah, I'll change that in v3!
Lorenzo Stoakes Oct. 21, 2024, 3:41 p.m. UTC | #6
On Mon, Oct 21, 2024 at 04:33:19PM +0100, Lorenzo Stoakes wrote:
[snip]


> * For hardware poison markers it makes sense to drop them as you're
>   effectively saying 'I am done with this range that is now unbacked and
>   expect to get an empty page should I use it now'. UFFD WP I am not sure
>   about but presumably also fine.

[snip]

CORRECTION: (sorry multitasking atm now I have a non-melted CPU) UFFD WP is
            safe from MADV_FREE as it checks is_poisoned_swp_entry() which
            this patch updates to include guard pages, a change I will undo
            in v3. So disregard comment on UFFD WP here.
David Hildenbrand Oct. 21, 2024, 4 p.m. UTC | #7
On 21.10.24 17:33, Lorenzo Stoakes wrote:
> On Mon, Oct 21, 2024 at 04:54:06PM +0200, Vlastimil Babka wrote:
>> On 10/21/24 16:33, Lorenzo Stoakes wrote:
>>> On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
>>>> On 10/20/24 18:20, Lorenzo Stoakes wrote:
>>>>> Add a new PTE marker that results in any access causing the accessing
>>>>> process to segfault.
>>>>>
>>>>> This is preferable to PTE_MARKER_POISONED, which results in the same
>>>>> handling as hardware poisoned memory, and is thus undesirable for cases
>>>>> where we simply wish to 'soft' poison a range.
>>>>>
>>>>> This is in preparation for implementing the ability to specify guard pages
>>>>> at the page table level, i.e. ranges that, when accessed, should cause
>>>>> process termination.
>>>>>
>>>>> Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
>>>>> function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
>>>>> purpose was simply incorrect.
>>>>>
>>>>> We then reuse the same logic to determine whether a zap should clear a
>>>>> guard entry - this should only be performed on teardown and never on
>>>>> MADV_DONTNEED or the like.
>>>>
>>>> Since I would have personally put MADV_FREE among "or the like" here, it's
>>>> surprising to me that it in fact it's tearing down the guard entries now. Is
>>>> that intentional? It should be at least mentioned very explicitly. But I'd
>>>> really argue against it, as MADV_FREE is to me a weaker form of
>>>> MADV_DONTNEED - the existing pages are not zapped immediately but
>>>> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why
>>>> shouldn't MADV_FREE too?
>>>
>>> That is not, as I understand it, what MADV_FREE is, semantically. From the
>>> man pages:
>>>
>>>         MADV_FREE (since Linux 4.5)
>>>
>>>                The application no longer requires the pages in the range
>>>                specified by addr and len.  The kernel can thus free these
>>>                pages, but the freeing could be delayed until memory pressure
>>>                occurs.
>>>
>>>         MADV_DONTNEED
>>>
>>>                Do not expect access in the near future.  (For the time
>>>                being, the application is finished with the given range, so
>>>                the kernel can free resources associated with it.)
>>>
>>> MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we
>>> don't expect to use it in the near future'.
>>
>> I think the description gives a wrong impression. What I think matters it
>> what happens (limited to anon private case as MADV_FREE doesn't support any
>> other)
>>
>> MADV_DONTNEED - pages discarded immediately, further access gives new
>> zero-filled pages
>>
>> MADV_FREE - pages prioritized for discarding, if that happens before next
>> write, it gets zero-filled page on next access, but a write done soon enough
>>   can cancel the upcoming discard.
>>
>> In that sense, MADV_FREE is a weaker form of DONTNEED, no?
>>
>>>>
>>>> Seems to me rather currently an artifact of MADV_FREE implementation - if it
>>>> encounters hwpoison entries it will tear them down because why not, we have
>>>> detected a hw memory error and are lucky the program wants to discard the
>>>> pages and not access them, so best use the opportunity and get rid of the
>>>> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly
>>>> could).
>>>
>>> Right, but we explicitly do not tear them down in the case of MADV_DONTNEED
>>> which matches the description in the manpages that the user _might_ come
>>> back to the range, whereas MADV_FREE means they are truly done but just
>>> don't want the overhead of actually unmapping at this point.
>>
>> But it's also defined what happens if user comes back to the range after a
>> MADV_FREE. I think the overhead saved happens in the case of actually coming
>> back soon enough to prevent the discard. With MADV_DONTNEED its immediate
>> and unconditional.
>>
>>> Seems to be this is moreso that MADV_FREE is a not-really-as-efficient
>>> version of what Rik wants to do with his MADV_LAZYFREE thing.
>>
>> I think that further optimizes MADV_FREE, which is already more optimized
>> than MADV_DONTNEED.
>>
>>>>
>>>> But to extend this to guard PTEs which are result of an explicit userspace
>>>> action feels wrong, unless the semantics is the same for MADV_DONTEED. The
>>>> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave
>>>> the same?
>>>
>>> My understanding from the above is that MADV_FREE is a softer version of
>>> munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a
>>> 'revert state to when I first mapped this stuff because I'm done with it
>>> for now but might use it later'.
>>
>>  From the implementation I get the opposite understanding. Neither tears down
>> the vma like a proper unmap(). MADV_DONTNEED zaps page tables immediately,
>> MADV_FREE effectively too but with a delay depending on memory pressure.
>>
> 
> OK so based on IRC chat I think the conclusion here is TL;DR yes we have to
> change this, you're right :)
> 
> To summarise for on-list:
> 
> * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the
>    ability to be 'cancelled' if you write to the memory. Also, after the
>    freeing is complete, you can write to the memory to reuse it, the mapping
>    is still there.
> 
> * For hardware poison markers it makes sense to drop them as you're
>    effectively saying 'I am done with this range that is now unbacked and
>    expect to get an empty page should I use it now'. UFFD WP I am not sure
>    about but presumably also fine.
> 
> * However, guard pages are different - if you 'cancel' and you are left
>    with a block of memory allocated to you by a pthread or userland
>    allocator implementation, you don't want to then no longer be protected
>    from overrunning into other thread memory.

Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored 
or error? It sounds like a usage "error" to me (in contrast to munmap()).
Lorenzo Stoakes Oct. 21, 2024, 4:23 p.m. UTC | #8
On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote:
[snip]
> >
> > To summarise for on-list:
> >
> > * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the
> >    ability to be 'cancelled' if you write to the memory. Also, after the
> >    freeing is complete, you can write to the memory to reuse it, the mapping
> >    is still there.
> >
> > * For hardware poison markers it makes sense to drop them as you're
> >    effectively saying 'I am done with this range that is now unbacked and
> >    expect to get an empty page should I use it now'. UFFD WP I am not sure
> >    about but presumably also fine.
> >
> > * However, guard pages are different - if you 'cancel' and you are left
> >    with a block of memory allocated to you by a pthread or userland
> >    allocator implementation, you don't want to then no longer be protected
> >    from overrunning into other thread memory.
>
> Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or
> error? It sounds like a usage "error" to me (in contrast to munmap()).

It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in
place, from v3 we will do the same for MADV_FREE.

I'm not sure I'd say it's an error per se, as somebody might have a use case
where they want to zap over a range but keep guard pages, perhaps an allocator
or something?

Also the existing logic is that existing markers (HW poison, uffd-simulated HW
poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and
no error on MADV_FREE either, so it'd be consistent with existing behaviour.

Also semantically you are achieving what the calls expect you are freeing the
ranges since the guard page regions are unbacked so are already freed... so yeah
I don't think an error really makes sense here.

We might also be limiting use cases by assuming they might _only_ be used for
allocators and such.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Oct. 21, 2024, 4:44 p.m. UTC | #9
On 21.10.24 18:23, Lorenzo Stoakes wrote:
> On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote:
> [snip]
>>>
>>> To summarise for on-list:
>>>
>>> * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the
>>>     ability to be 'cancelled' if you write to the memory. Also, after the
>>>     freeing is complete, you can write to the memory to reuse it, the mapping
>>>     is still there.
>>>
>>> * For hardware poison markers it makes sense to drop them as you're
>>>     effectively saying 'I am done with this range that is now unbacked and
>>>     expect to get an empty page should I use it now'. UFFD WP I am not sure
>>>     about but presumably also fine.
>>>
>>> * However, guard pages are different - if you 'cancel' and you are left
>>>     with a block of memory allocated to you by a pthread or userland
>>>     allocator implementation, you don't want to then no longer be protected
>>>     from overrunning into other thread memory.
>>
>> Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or
>> error? It sounds like a usage "error" to me (in contrast to munmap()).
> 
> It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in
> place, from v3 we will do the same for MADV_FREE.
> 
> I'm not sure I'd say it's an error per se, as somebody might have a use case
> where they want to zap over a range but keep guard pages, perhaps an allocator
> or something?

Hm, not sure I see use for that.

Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would 
process PROT_NONE. So current behavior is at least consistent with 
PROT_NONE handling (where something could be mapped, though).

No strong opinion.

> 
> Also the existing logic is that existing markers (HW poison, uffd-simulated HW
> poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and
> no error on MADV_FREE either, so it'd be consistent with existing behaviour.


HW poison / uffd-simulated HW poison are expected to be zapped: it's 
just like a mapped page with HWPOISON. So that is correct.
	
UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap 
uffd-wp entries.

> 
> Also semantically you are achieving what the calls expect you are freeing the
> ranges since the guard page regions are unbacked so are already freed... so yeah
> I don't think an error really makes sense here.

I you compare it to a VMA hole, it make sense to fail. If we treat it 
like PROT_NONE, it make sense to skip them.

> 
> We might also be limiting use cases by assuming they might _only_ be used for
> allocators and such.

I don't buy that as an argument, sorry :)

"Let's map the kernel writable into all user space because otherwise we 
might be limiting use cases"


:P
Lorenzo Stoakes Oct. 21, 2024, 4:51 p.m. UTC | #10
On Mon, Oct 21, 2024 at 06:44:04PM +0200, David Hildenbrand wrote:
> On 21.10.24 18:23, Lorenzo Stoakes wrote:
> > On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote:
> > [snip]
> > > >
> > > > To summarise for on-list:
> > > >
> > > > * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the
> > > >     ability to be 'cancelled' if you write to the memory. Also, after the
> > > >     freeing is complete, you can write to the memory to reuse it, the mapping
> > > >     is still there.
> > > >
> > > > * For hardware poison markers it makes sense to drop them as you're
> > > >     effectively saying 'I am done with this range that is now unbacked and
> > > >     expect to get an empty page should I use it now'. UFFD WP I am not sure
> > > >     about but presumably also fine.
> > > >
> > > > * However, guard pages are different - if you 'cancel' and you are left
> > > >     with a block of memory allocated to you by a pthread or userland
> > > >     allocator implementation, you don't want to then no longer be protected
> > > >     from overrunning into other thread memory.
> > >
> > > Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or
> > > error? It sounds like a usage "error" to me (in contrast to munmap()).
> >
> > It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in
> > place, from v3 we will do the same for MADV_FREE.
> >
> > I'm not sure I'd say it's an error per se, as somebody might have a use case
> > where they want to zap over a range but keep guard pages, perhaps an allocator
> > or something?
>
> Hm, not sure I see use for that.
>
> Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would
> process PROT_NONE. So current behavior is at least consistent with PROT_NONE
> handling (where something could be mapped, though).

Err, the handling of holes is terrible, yes we return ENOMEM, but we _carry out
the whole procedure_ then return an error, an error _indistinguishable from an
error arising from any of the individual parts_.

Which is just, awful.

>
> No strong opinion.

Well you used up your strong opinion on the naming ;)

>
> >
> > Also the existing logic is that existing markers (HW poison, uffd-simulated HW
> > poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and
> > no error on MADV_FREE either, so it'd be consistent with existing behaviour.
>
>
> HW poison / uffd-simulated HW poison are expected to be zapped: it's just
> like a mapped page with HWPOISON. So that is correct.

Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I
mean the MADV flags are a confusing mess generally, as per Vlasta's comments
which to begin with I strongly disagreed with then, discussing further, realsed
that no this is just a bit insane and had driven _me_ insane.

>
> UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp
> entries.
>
> >
> > Also semantically you are achieving what the calls expect you are freeing the
> > ranges since the guard page regions are unbacked so are already freed... so yeah
> > I don't think an error really makes sense here.
>
> I you compare it to a VMA hole, it make sense to fail. If we treat it like
> PROT_NONE, it make sense to skip them.
>
> >
> > We might also be limiting use cases by assuming they might _only_ be used for
> > allocators and such.
>
> I don't buy that as an argument, sorry :)
>
> "Let's map the kernel writable into all user space because otherwise we
> might be limiting use cases"

That's a great idea! Patch series incoming, 1st April 2025... :>)
>
>
> :P
>
> --
> Cheers,
>
> David / dhildenb
>

Overall I think just always leaving in place except on remedy err sorry sorry
unpoison and munmap and not returning an error if encountered elsewhere (other
than, of course, GUP) is the right way forward and most in line with user
expectation and practical usage.
David Hildenbrand Oct. 21, 2024, 5 p.m. UTC | #11
On 21.10.24 18:51, Lorenzo Stoakes wrote:
> On Mon, Oct 21, 2024 at 06:44:04PM +0200, David Hildenbrand wrote:
>> On 21.10.24 18:23, Lorenzo Stoakes wrote:
>>> On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote:
>>> [snip]
>>>>>
>>>>> To summarise for on-list:
>>>>>
>>>>> * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the
>>>>>      ability to be 'cancelled' if you write to the memory. Also, after the
>>>>>      freeing is complete, you can write to the memory to reuse it, the mapping
>>>>>      is still there.
>>>>>
>>>>> * For hardware poison markers it makes sense to drop them as you're
>>>>>      effectively saying 'I am done with this range that is now unbacked and
>>>>>      expect to get an empty page should I use it now'. UFFD WP I am not sure
>>>>>      about but presumably also fine.
>>>>>
>>>>> * However, guard pages are different - if you 'cancel' and you are left
>>>>>      with a block of memory allocated to you by a pthread or userland
>>>>>      allocator implementation, you don't want to then no longer be protected
>>>>>      from overrunning into other thread memory.
>>>>
>>>> Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or
>>>> error? It sounds like a usage "error" to me (in contrast to munmap()).
>>>
>>> It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in
>>> place, from v3 we will do the same for MADV_FREE.
>>>
>>> I'm not sure I'd say it's an error per se, as somebody might have a use case
>>> where they want to zap over a range but keep guard pages, perhaps an allocator
>>> or something?
>>
>> Hm, not sure I see use for that.
>>
>> Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would
>> process PROT_NONE. So current behavior is at least consistent with PROT_NONE
>> handling (where something could be mapped, though).
> 
> Err, the handling of holes is terrible, yes we return ENOMEM, but we _carry out
> the whole procedure_ then return an error, an error _indistinguishable from an
> error arising from any of the individual parts_.
> 
> Which is just, awful.

Yes, absolutely. I don't know why we decided to continue. And why we 
return ENOMEM ...

> 
>>
>> No strong opinion.
> 
> Well you used up your strong opinion on the naming ;)

He, and I am out of energy for this year ;)

In retrospective, "install or remove a guard PTE" is just much better 
than anything else ...

So I should never have been mislead to suggest poison/unpoison as a 
replacement for poison/remedy :P

> 
>>
>>>
>>> Also the existing logic is that existing markers (HW poison, uffd-simulated HW
>>> poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and
>>> no error on MADV_FREE either, so it'd be consistent with existing behaviour.
>>
>>
>> HW poison / uffd-simulated HW poison are expected to be zapped: it's just
>> like a mapped page with HWPOISON. So that is correct.
> 
> Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I

Huh?

madvise_dontneed_single_vma()->zap_page_range_single(details=NULL)->unmap_single_vma(details=NULL) 
... zap_pte_range()

} else if (is_hwpoison_entry(entry) ||
	   is_poisoned_swp_entry(entry)) {
	if (!should_zap_cows(details))
		continue;
	...

Should just zap them.

What am I missing?

> mean the MADV flags are a confusing mess generally, as per Vlasta's comments
> which to begin with I strongly disagreed with then, discussing further, realsed
> that no this is just a bit insane and had driven _me_ insane.
> 
>>
>> UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp
>> entries.
>>
>>>
>>> Also semantically you are achieving what the calls expect you are freeing the
>>> ranges since the guard page regions are unbacked so are already freed... so yeah
>>> I don't think an error really makes sense here.
>>
>> I you compare it to a VMA hole, it make sense to fail. If we treat it like
>> PROT_NONE, it make sense to skip them.
>>
>>>
>>> We might also be limiting use cases by assuming they might _only_ be used for
>>> allocators and such.
>>
>> I don't buy that as an argument, sorry :)
>>
>> "Let's map the kernel writable into all user space because otherwise we
>> might be limiting use cases"
> 
> That's a great idea! Patch series incoming, 1st April 2025... :>)

:) Just flip the bit on x86 and we're done!

>>
>>
>> :P
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> Overall I think just always leaving in place except on remedy err sorry sorry
> unpoison and munmap and not returning an error if encountered elsewhere (other
> than, of course, GUP) is the right way forward and most in line with user
> expectation and practical usage.


Fine with me, make sure to document that is behaves like a PROT_NONE 
VMA, not like a memory hole, except when something would trigger a fault 
(GUP etc).
Lorenzo Stoakes Oct. 21, 2024, 5:14 p.m. UTC | #12
On Mon, Oct 21, 2024 at 07:00:53PM +0200, David Hildenbrand wrote:
> On 21.10.24 18:51, Lorenzo Stoakes wrote:
> > On Mon, Oct 21, 2024 at 06:44:04PM +0200, David Hildenbrand wrote:
> > > On 21.10.24 18:23, Lorenzo Stoakes wrote:
> > > > On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote:
> > > > [snip]
> > > > > >
> > > > > > To summarise for on-list:
> > > > > >
> > > > > > * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the
> > > > > >      ability to be 'cancelled' if you write to the memory. Also, after the
> > > > > >      freeing is complete, you can write to the memory to reuse it, the mapping
> > > > > >      is still there.
> > > > > >
> > > > > > * For hardware poison markers it makes sense to drop them as you're
> > > > > >      effectively saying 'I am done with this range that is now unbacked and
> > > > > >      expect to get an empty page should I use it now'. UFFD WP I am not sure
> > > > > >      about but presumably also fine.
> > > > > >
> > > > > > * However, guard pages are different - if you 'cancel' and you are left
> > > > > >      with a block of memory allocated to you by a pthread or userland
> > > > > >      allocator implementation, you don't want to then no longer be protected
> > > > > >      from overrunning into other thread memory.
> > > > >
> > > > > Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or
> > > > > error? It sounds like a usage "error" to me (in contrast to munmap()).
> > > >
> > > > It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in
> > > > place, from v3 we will do the same for MADV_FREE.
> > > >
> > > > I'm not sure I'd say it's an error per se, as somebody might have a use case
> > > > where they want to zap over a range but keep guard pages, perhaps an allocator
> > > > or something?
> > >
> > > Hm, not sure I see use for that.
> > >
> > > Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would
> > > process PROT_NONE. So current behavior is at least consistent with PROT_NONE
> > > handling (where something could be mapped, though).
> >
> > Err, the handling of holes is terrible, yes we return ENOMEM, but we _carry out
> > the whole procedure_ then return an error, an error _indistinguishable from an
> > error arising from any of the individual parts_.
> >
> > Which is just, awful.
>
> Yes, absolutely. I don't know why we decided to continue. And why we return
> ENOMEM ...

Anyway UAPI so no turning back now :)

>
> >
> > >
> > > No strong opinion.
> >
> > Well you used up your strong opinion on the naming ;)
>
> He, and I am out of energy for this year ;)

Haha understandable...

>
> In retrospective, "install or remove a guard PTE" is just much better than
> anything else ...
>
> So I should never have been mislead to suggest poison/unpoison as a
> replacement for poison/remedy :P

You know the reason ;)

>
> >
> > >
> > > >
> > > > Also the existing logic is that existing markers (HW poison, uffd-simulated HW
> > > > poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and
> > > > no error on MADV_FREE either, so it'd be consistent with existing behaviour.
> > >
> > >
> > > HW poison / uffd-simulated HW poison are expected to be zapped: it's just
> > > like a mapped page with HWPOISON. So that is correct.
> >
> > Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I
>
> Huh?
>
> madvise_dontneed_single_vma()->zap_page_range_single(details=NULL)->unmap_single_vma(details=NULL)
> ... zap_pte_range()
>
> } else if (is_hwpoison_entry(entry) ||
> 	   is_poisoned_swp_entry(entry)) {
> 	if (!should_zap_cows(details))
> 		continue;
> 	...
>
> Should just zap them.
>
> What am I missing?

Yeah ok it's me who's missing something here, I hadn't noticed details == NULL
so should_zap_cows() is true, my mistake!

In any case we explicitly add code here to prevent guard pages from going. I
will correct everything where I wrongly say otherwise, doh!

>
> > mean the MADV flags are a confusing mess generally, as per Vlasta's comments
> > which to begin with I strongly disagreed with then, discussing further, realsed
> > that no this is just a bit insane and had driven _me_ insane.
> >
> > >
> > > UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp
> > > entries.
> > >
> > > >
> > > > Also semantically you are achieving what the calls expect you are freeing the
> > > > ranges since the guard page regions are unbacked so are already freed... so yeah
> > > > I don't think an error really makes sense here.
> > >
> > > I you compare it to a VMA hole, it make sense to fail. If we treat it like
> > > PROT_NONE, it make sense to skip them.
> > >
> > > >
> > > > We might also be limiting use cases by assuming they might _only_ be used for
> > > > allocators and such.
> > >
> > > I don't buy that as an argument, sorry :)
> > >
> > > "Let's map the kernel writable into all user space because otherwise we
> > > might be limiting use cases"
> >
> > That's a great idea! Patch series incoming, 1st April 2025... :>)
>
> :) Just flip the bit on x86 and we're done!

;)

>
> > >
> > >
> > > :P
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> > Overall I think just always leaving in place except on remedy err sorry sorry
> > unpoison and munmap and not returning an error if encountered elsewhere (other
> > than, of course, GUP) is the right way forward and most in line with user
> > expectation and practical usage.
>
>
> Fine with me, make sure to document that is behaves like a PROT_NONE VMA,
> not like a memory hole, except when something would trigger a fault (GUP
> etc).

Ack will make sure to document.

>
>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Oct. 22, 2024, 7:13 p.m. UTC | #13
On 21.10.24 19:26, Vlastimil Babka wrote:
> On 10/21/24 19:14, Lorenzo Stoakes wrote:
>> On Mon, Oct 21, 2024 at 07:00:53PM +0200, David Hildenbrand wrote:
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Also the existing logic is that existing markers (HW poison, uffd-simulated HW
>>>>>> poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and
>>>>>> no error on MADV_FREE either, so it'd be consistent with existing behaviour.
>>>>>
>>>>>
>>>>> HW poison / uffd-simulated HW poison are expected to be zapped: it's just
>>>>> like a mapped page with HWPOISON. So that is correct.
>>>>
>>>> Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I
>>>
>>> Huh?
>>>
>>> madvise_dontneed_single_vma()->zap_page_range_single(details=NULL)->unmap_single_vma(details=NULL)
>>> ... zap_pte_range()
>>>
>>> } else if (is_hwpoison_entry(entry) ||
>>> 	   is_poisoned_swp_entry(entry)) {
>>> 	if (!should_zap_cows(details))
>>> 		continue;
>>> 	...
>>>
>>> Should just zap them.
>>>
>>> What am I missing?
>>
>> Yeah ok it's me who's missing something here, I hadn't noticed details == NULL
>> so should_zap_cows() is true, my mistake!
> 
> Well, good to know it's consistent then. As I've explained I see why zapping
> actual hwpoison makes sense for MADV_DONTNEED/MADV_FREE. That it's done also
> for uffd poison is not completely clear, but maybe it was just easier to
> implement. 

Note that in VM context "uffd poison" really just is "this was hwpoison 
on the source VM, so we mimic that on the destination VM, because the 
data *is* lost" -- so you want the exact same behavior.

For example, when a VM reboots you might just want to ZAP these hwpoison 
entries, and get fresh pages on next access.

So to me it makes sense that they are treated equally.