Message ID | cover.1729440856.git.lorenzo.stoakes@oracle.com |
---|---|
Headers | show |
Series | implement lightweight guard pages | expand |
On Sun, Oct 20, 2024 at 07:37:54PM +0200, Florian Weimer wrote: > * Lorenzo Stoakes: > > > Early testing of the prototype version of this code suggests a 5 times > > speed up in memory mapping invocations (in conjunction with use of > > process_madvise()) and a 13% reduction in VMAs on an entirely idle android > > system and unoptimised code. > > > > We expect with optimisation and a loaded system with a larger number of > > guard pages this could significantly increase, but in any case these > > numbers are encouraging. > > > > This way, rather than having separate VMAs specifying which parts of a > > range are guard pages, instead we have a VMA spanning the entire range of > > memory a user is permitted to access and including ranges which are to be > > 'guarded'. > > > > After mapping this, a user can specify which parts of the range should > > result in a fatal signal when accessed. > > > > By restricting the ability to specify guard pages to memory mapped by > > existing VMAs, we can rely on the mappings being torn down when the > > mappings are ultimately unmapped and everything works simply as if the > > memory were not faulted in, from the point of view of the containing VMAs. > > We have a glibc (so not Android) dynamic linker bug that asks us to > remove PROT_NONE mappings in mapped shared objects: > > Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size > <https://sourceware.org/bugzilla/show_bug.cgi?id=31076> > > It's slightly different from a guard page because our main goal is to > avoid other mappings to end up in those gaps, which has been shown to > cause odd application behavior in cases where it happens. If I > understand the series correctly, the kernel would not automatically > attribute those PROT_NONE gaps to the previous or subsequent mapping. > We would have to extend one of the surrounding mapps and apply > MADV_POISON to that over-mapped part. That doesn't seem too onerous. > > Could the ELF loader in the kernel do the same thing for the main > executable and the program loader? Currently this implementation is only available for private anonymous memory. We may look at extending it to shmem and file-backed memory in the future but there are a whole host of things to overcome to make that work so it's one step at a time! :)
On Mon, Oct 21, 2024 at 03:27:55PM +0200, Vlastimil Babka wrote: > On 10/20/24 18:20, Lorenzo Stoakes wrote: > > The existing generic pagewalk logic permits the walking of page tables, > > invoking callbacks at individual page table levels via user-provided > > mm_walk_ops callbacks. > > > > This is useful for traversing existing page table entries, but precludes > > the ability to establish new ones. > > > > Existing mechanism for performing a walk which also installs page table > > entries if necessary are heavily duplicated throughout the kernel, each > > with semantic differences from one another and largely unavailable for use > > elsewhere. > > > > Rather than add yet another implementation, we extend the generic pagewalk > > logic to enable the installation of page table entries by adding a new > > install_pte() callback in mm_walk_ops. If this is specified, then upon > > encountering a missing page table entry, we allocate and install a new one > > and continue the traversal. > > > > If a THP huge page is encountered, we make use of existing logic to split > > it. Then once we reach the PTE level, we invoke the install_pte() callback > > which provides a PTE entry to install. We do not support hugetlb at this > > stage. > > > > If this function returns an error, or an allocation fails during the > > operation, we abort the operation altogether. It is up to the caller to > > deal appropriately with partially populated page table ranges. > > > > If install_pte() is defined, the semantics of pte_entry() change - this > > callback is then only invoked if the entry already exists. This is a useful > > property, as it allows a caller to handle existing PTEs while installing > > new ones where necessary in the specified range. > > > > If install_pte() is not defined, then there is no functional difference to > > this patch, so all existing logic will work precisely as it did before. > > > > As we only permit the installation of PTEs where a mapping does not already > > exist there is no need for TLB management, however we do invoke > > update_mmu_cache() for architectures which require manual maintenance of > > mappings for other CPUs. > > > > We explicitly do not allow the existing page walk API to expose this > > feature as it is dangerous and intended for internal mm use only. Therefore > > we provide a new walk_page_range_mm() function exposed only to > > mm/internal.h. > > > > Reviewed-by: Jann Horn <jannh@google.com> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > <snip> > > > /* > > * We want to know the real level where a entry is located ignoring any > > * folding of levels which may be happening. For example if p4d is folded then > > @@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr, > > int err = 0; > > > > for (;;) { > > - err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); > > - if (err) > > - break; > > + if (ops->install_pte && pte_none(ptep_get(pte))) { > > + pte_t new_pte; > > + > > + err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte, > > + walk); > > + if (err) > > + break; > > + > > + set_pte_at(walk->mm, addr, pte, new_pte); > > While the guard pages install ptes unconditionally, maybe some install_pte > handler implementation would sometimes want to skip, should ve define an > error code that means its skipped and just continue instead of set_pte_at()? > Or leave it until such handler appears. I'm not sure under what circumstances you'd want to skip though precisely? There's nothing populated, and the user is defining the range in which to install a PTE if nothing is populated. If they wanted more complicated handling they could do multiple, smaller calls. Things are inherently racey with these walks so there's no benefit in doing everything at once. > > > + /* Non-present before, so for arches that need it. */ > > + if (!WARN_ON_ONCE(walk->no_vma)) > > + update_mmu_cache(walk->vma, addr, pte); > > + } else { > > + err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); > > + if (err) > > + break; > > + } > > if (addr >= end - PAGE_SIZE) > > break; > > addr += PAGE_SIZE; > > @@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, > > again: > > next = pmd_addr_end(addr, end); > > if (pmd_none(*pmd)) { > > - if (ops->pte_hole) > > + if (ops->install_pte) > > + err = __pte_alloc(walk->mm, pmd); > > + else if (ops->pte_hole) > > err = ops->pte_hole(addr, next, depth, walk); > > if (err) > > break; > > - continue; > > + if (!ops->install_pte) > > + continue; > > } > > > > walk->action = ACTION_SUBTREE; > > @@ -116,7 +138,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, > > */ > > if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) || > > walk->action == ACTION_CONTINUE || > > - !(ops->pte_entry)) > > + !(ops->pte_entry || ops->install_pte)) > > continue; > > BTW, I find it hard to read this condition even before your patch, oh well. Agreed, this badly needs refactoring, but felt out of scope for this change. > But if I read it correctly, does it mean we're going to split a pmd-mapped > THP if we have a install_pte handler? But is that really necessary if the > pmd splitting results in all ptes populated, and thus the install_pte > handler can't do anything with any pte anyway? Yes... however nothing else here in the logic has special handling for transhuge pages AND there is already an interface provided to prevent this if you want, which we use in commit 3/5, that is to provide pud, pmd walkers that set walk->action = ACTION_CONTINUE if transhuge. Having said that, it kind of sucks that we are doing a useless split here. Hmm. In the pte_entry() case you might very well want to split and do something with the PTE. With the install you are only interested if it's non-present... It's not _completely_ infeasible that a user would want this, but it's very unlikely. OK so yeah let's add it and clean up this expression while we're at it, will fix on respin. > > > if (walk->vma) > > @@ -148,11 +170,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > > again: > > next = pud_addr_end(addr, end); > > if (pud_none(*pud)) { > > - if (ops->pte_hole) > > + if (ops->install_pte) > > + err = __pmd_alloc(walk->mm, pud, addr); > > + else if (ops->pte_hole) > > err = ops->pte_hole(addr, next, depth, walk); > > if (err) > > break; > > - continue; > > + if (!ops->install_pte) > > + continue; > > } > > > > walk->action = ACTION_SUBTREE; > > @@ -167,7 +192,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > > > > if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || > > walk->action == ACTION_CONTINUE || > > - !(ops->pmd_entry || ops->pte_entry)) > > + !(ops->pmd_entry || ops->pte_entry || ops->install_pte)) > > continue; > > Ditto? >
On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote: > On 10/20/24 18:20, Lorenzo Stoakes wrote: > > Add a new PTE marker that results in any access causing the accessing > > process to segfault. > > > > This is preferable to PTE_MARKER_POISONED, which results in the same > > handling as hardware poisoned memory, and is thus undesirable for cases > > where we simply wish to 'soft' poison a range. > > > > This is in preparation for implementing the ability to specify guard pages > > at the page table level, i.e. ranges that, when accessed, should cause > > process termination. > > > > Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the > > function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single > > purpose was simply incorrect. > > > > We then reuse the same logic to determine whether a zap should clear a > > guard entry - this should only be performed on teardown and never on > > MADV_DONTNEED or the like. > > Since I would have personally put MADV_FREE among "or the like" here, it's > surprising to me that it in fact it's tearing down the guard entries now. Is > that intentional? It should be at least mentioned very explicitly. But I'd > really argue against it, as MADV_FREE is to me a weaker form of > MADV_DONTNEED - the existing pages are not zapped immediately but > prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why > shouldn't MADV_FREE too? That is not, as I understand it, what MADV_FREE is, semantically. From the man pages: MADV_FREE (since Linux 4.5) The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. MADV_DONTNEED Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we don't expect to use it in the near future'. > > Seems to me rather currently an artifact of MADV_FREE implementation - if it > encounters hwpoison entries it will tear them down because why not, we have > detected a hw memory error and are lucky the program wants to discard the > pages and not access them, so best use the opportunity and get rid of the > PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly > could). Right, but we explicitly do not tear them down in the case of MADV_DONTNEED which matches the description in the manpages that the user _might_ come back to the range, whereas MADV_FREE means they are truly done but just don't want the overhead of actually unmapping at this point. Seems to be this is moreso that MADV_FREE is a not-really-as-efficient version of what Rik wants to do with his MADV_LAZYFREE thing. > > But to extend this to guard PTEs which are result of an explicit userspace > action feels wrong, unless the semantics is the same for MADV_DONTEED. The > semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave > the same? My understanding from the above is that MADV_FREE is a softer version of munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a 'revert state to when I first mapped this stuff because I'm done with it for now but might use it later'. Obviously happy to be corrected if my understanding of this is off-the-mark but this is what motivated me to do it this way! > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > --- > > include/linux/mm_inline.h | 2 +- > > include/linux/swapops.h | 26 ++++++++++++++++++++++++-- > > mm/hugetlb.c | 3 +++ > > mm/memory.c | 18 +++++++++++++++--- > > 4 files changed, 43 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h > > index 355cf46a01a6..1b6a917fffa4 100644 > > --- a/include/linux/mm_inline.h > > +++ b/include/linux/mm_inline.h > > @@ -544,7 +544,7 @@ static inline pte_marker copy_pte_marker( > > { > > pte_marker srcm = pte_marker_get(entry); > > /* Always copy error entries. */ > > - pte_marker dstm = srcm & PTE_MARKER_POISONED; > > + pte_marker dstm = srcm & (PTE_MARKER_POISONED | PTE_MARKER_GUARD); > > > > /* Only copy PTE markers if UFFD register matches. */ > > if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma)) > > diff --git a/include/linux/swapops.h b/include/linux/swapops.h > > index cb468e418ea1..4d0606df0791 100644 > > --- a/include/linux/swapops.h > > +++ b/include/linux/swapops.h > > @@ -426,9 +426,15 @@ typedef unsigned long pte_marker; > > * "Poisoned" here is meant in the very general sense of "future accesses are > > * invalid", instead of referring very specifically to hardware memory errors. > > * This marker is meant to represent any of various different causes of this. > > + * > > + * Note that, when encountered by the faulting logic, PTEs with this marker will > > + * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error > > + * logic. > > */ > > #define PTE_MARKER_POISONED BIT(1) > > -#define PTE_MARKER_MASK (BIT(2) - 1) > > +/* Indicates that, on fault, this PTE will case a SIGSEGV signal to be sent. */ > > +#define PTE_MARKER_GUARD BIT(2) > > +#define PTE_MARKER_MASK (BIT(3) - 1) > > > > static inline swp_entry_t make_pte_marker_entry(pte_marker marker) > > { > > @@ -461,9 +467,25 @@ static inline swp_entry_t make_poisoned_swp_entry(void) > > } > > > > static inline int is_poisoned_swp_entry(swp_entry_t entry) > > +{ > > + /* > > + * We treat guard pages as poisoned too as these have the same semantics > > + * as poisoned ranges, only with different fault handling. > > + */ > > + return is_pte_marker_entry(entry) && > > + (pte_marker_get(entry) & > > + (PTE_MARKER_POISONED | PTE_MARKER_GUARD)); > > +} > > + > > +static inline swp_entry_t make_guard_swp_entry(void) > > +{ > > + return make_pte_marker_entry(PTE_MARKER_GUARD); > > +} > > + > > +static inline int is_guard_swp_entry(swp_entry_t entry) > > { > > return is_pte_marker_entry(entry) && > > - (pte_marker_get(entry) & PTE_MARKER_POISONED); > > + (pte_marker_get(entry) & PTE_MARKER_GUARD); > > } > > > > /* > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 906294ac85dc..50e3f6ed73ac 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -6353,6 +6353,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > ret = VM_FAULT_HWPOISON_LARGE | > > VM_FAULT_SET_HINDEX(hstate_index(h)); > > goto out_mutex; > > + } else if (marker & PTE_MARKER_GUARD) { > > + ret = VM_FAULT_SIGSEGV; > > + goto out_mutex; > > } > > } > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 0f614523b9f4..551455cd453f 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -1455,7 +1455,7 @@ static inline bool should_zap_folio(struct zap_details *details, > > return !folio_test_anon(folio); > > } > > > > -static inline bool zap_drop_file_uffd_wp(struct zap_details *details) > > +static inline bool zap_drop_markers(struct zap_details *details) > > { > > if (!details) > > return false; > > @@ -1476,7 +1476,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, > > if (vma_is_anonymous(vma)) > > return; > > > > - if (zap_drop_file_uffd_wp(details)) > > + if (zap_drop_markers(details)) > > return; > > > > for (;;) { > > @@ -1671,7 +1671,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > > * drop the marker if explicitly requested. > > */ > > if (!vma_is_anonymous(vma) && > > - !zap_drop_file_uffd_wp(details)) > > + !zap_drop_markers(details)) > > + continue; > > + } else if (is_guard_swp_entry(entry)) { > > + /* > > + * Ordinary zapping should not remove guard PTE > > + * markers. Only do so if we should remove PTE markers > > + * in general. > > + */ > > + if (!zap_drop_markers(details)) > > continue; > > } else if (is_hwpoison_entry(entry) || > > is_poisoned_swp_entry(entry)) { > > @@ -4003,6 +4011,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) > > if (marker & PTE_MARKER_POISONED) > > return VM_FAULT_HWPOISON; > > > > + /* Hitting a guard page is always a fatal condition. */ > > + if (marker & PTE_MARKER_GUARD) > > + return VM_FAULT_SIGSEGV; > > + > > if (pte_marker_entry_uffd_wp(entry)) > > return pte_marker_handle_uffd_wp(vmf); > > >
On 10/21/24 16:33, Lorenzo Stoakes wrote: > On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote: >> On 10/20/24 18:20, Lorenzo Stoakes wrote: >> > Add a new PTE marker that results in any access causing the accessing >> > process to segfault. >> > >> > This is preferable to PTE_MARKER_POISONED, which results in the same >> > handling as hardware poisoned memory, and is thus undesirable for cases >> > where we simply wish to 'soft' poison a range. >> > >> > This is in preparation for implementing the ability to specify guard pages >> > at the page table level, i.e. ranges that, when accessed, should cause >> > process termination. >> > >> > Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the >> > function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single >> > purpose was simply incorrect. >> > >> > We then reuse the same logic to determine whether a zap should clear a >> > guard entry - this should only be performed on teardown and never on >> > MADV_DONTNEED or the like. >> >> Since I would have personally put MADV_FREE among "or the like" here, it's >> surprising to me that it in fact it's tearing down the guard entries now. Is >> that intentional? It should be at least mentioned very explicitly. But I'd >> really argue against it, as MADV_FREE is to me a weaker form of >> MADV_DONTNEED - the existing pages are not zapped immediately but >> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why >> shouldn't MADV_FREE too? > > That is not, as I understand it, what MADV_FREE is, semantically. From the > man pages: > > MADV_FREE (since Linux 4.5) > > The application no longer requires the pages in the range > specified by addr and len. The kernel can thus free these > pages, but the freeing could be delayed until memory pressure > occurs. > > MADV_DONTNEED > > Do not expect access in the near future. (For the time > being, the application is finished with the given range, so > the kernel can free resources associated with it.) > > MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we > don't expect to use it in the near future'. I think the description gives a wrong impression. What I think matters it what happens (limited to anon private case as MADV_FREE doesn't support any other) MADV_DONTNEED - pages discarded immediately, further access gives new zero-filled pages MADV_FREE - pages prioritized for discarding, if that happens before next write, it gets zero-filled page on next access, but a write done soon enough can cancel the upcoming discard. In that sense, MADV_FREE is a weaker form of DONTNEED, no? >> >> Seems to me rather currently an artifact of MADV_FREE implementation - if it >> encounters hwpoison entries it will tear them down because why not, we have >> detected a hw memory error and are lucky the program wants to discard the >> pages and not access them, so best use the opportunity and get rid of the >> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly >> could). > > Right, but we explicitly do not tear them down in the case of MADV_DONTNEED > which matches the description in the manpages that the user _might_ come > back to the range, whereas MADV_FREE means they are truly done but just > don't want the overhead of actually unmapping at this point. But it's also defined what happens if user comes back to the range after a MADV_FREE. I think the overhead saved happens in the case of actually coming back soon enough to prevent the discard. With MADV_DONTNEED its immediate and unconditional. > Seems to be this is moreso that MADV_FREE is a not-really-as-efficient > version of what Rik wants to do with his MADV_LAZYFREE thing. I think that further optimizes MADV_FREE, which is already more optimized than MADV_DONTNEED. >> >> But to extend this to guard PTEs which are result of an explicit userspace >> action feels wrong, unless the semantics is the same for MADV_DONTEED. The >> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave >> the same? > > My understanding from the above is that MADV_FREE is a softer version of > munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a > 'revert state to when I first mapped this stuff because I'm done with it > for now but might use it later'.
On Mon, Oct 21, 2024 at 04:54:06PM +0200, Vlastimil Babka wrote: > On 10/21/24 16:33, Lorenzo Stoakes wrote: > > On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote: > >> On 10/20/24 18:20, Lorenzo Stoakes wrote: > >> > Add a new PTE marker that results in any access causing the accessing > >> > process to segfault. > >> > > >> > This is preferable to PTE_MARKER_POISONED, which results in the same > >> > handling as hardware poisoned memory, and is thus undesirable for cases > >> > where we simply wish to 'soft' poison a range. > >> > > >> > This is in preparation for implementing the ability to specify guard pages > >> > at the page table level, i.e. ranges that, when accessed, should cause > >> > process termination. > >> > > >> > Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the > >> > function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single > >> > purpose was simply incorrect. > >> > > >> > We then reuse the same logic to determine whether a zap should clear a > >> > guard entry - this should only be performed on teardown and never on > >> > MADV_DONTNEED or the like. > >> > >> Since I would have personally put MADV_FREE among "or the like" here, it's > >> surprising to me that it in fact it's tearing down the guard entries now. Is > >> that intentional? It should be at least mentioned very explicitly. But I'd > >> really argue against it, as MADV_FREE is to me a weaker form of > >> MADV_DONTNEED - the existing pages are not zapped immediately but > >> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why > >> shouldn't MADV_FREE too? > > > > That is not, as I understand it, what MADV_FREE is, semantically. From the > > man pages: > > > > MADV_FREE (since Linux 4.5) > > > > The application no longer requires the pages in the range > > specified by addr and len. The kernel can thus free these > > pages, but the freeing could be delayed until memory pressure > > occurs. > > > > MADV_DONTNEED > > > > Do not expect access in the near future. (For the time > > being, the application is finished with the given range, so > > the kernel can free resources associated with it.) > > > > MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we > > don't expect to use it in the near future'. > > I think the description gives a wrong impression. What I think matters it > what happens (limited to anon private case as MADV_FREE doesn't support any > other) > > MADV_DONTNEED - pages discarded immediately, further access gives new > zero-filled pages > > MADV_FREE - pages prioritized for discarding, if that happens before next > write, it gets zero-filled page on next access, but a write done soon enough > can cancel the upcoming discard. > > In that sense, MADV_FREE is a weaker form of DONTNEED, no? > > >> > >> Seems to me rather currently an artifact of MADV_FREE implementation - if it > >> encounters hwpoison entries it will tear them down because why not, we have > >> detected a hw memory error and are lucky the program wants to discard the > >> pages and not access them, so best use the opportunity and get rid of the > >> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly > >> could). > > > > Right, but we explicitly do not tear them down in the case of MADV_DONTNEED > > which matches the description in the manpages that the user _might_ come > > back to the range, whereas MADV_FREE means they are truly done but just > > don't want the overhead of actually unmapping at this point. > > But it's also defined what happens if user comes back to the range after a > MADV_FREE. I think the overhead saved happens in the case of actually coming > back soon enough to prevent the discard. With MADV_DONTNEED its immediate > and unconditional. > > > Seems to be this is moreso that MADV_FREE is a not-really-as-efficient > > version of what Rik wants to do with his MADV_LAZYFREE thing. > > I think that further optimizes MADV_FREE, which is already more optimized > than MADV_DONTNEED. > > >> > >> But to extend this to guard PTEs which are result of an explicit userspace > >> action feels wrong, unless the semantics is the same for MADV_DONTEED. The > >> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave > >> the same? > > > > My understanding from the above is that MADV_FREE is a softer version of > > munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a > > 'revert state to when I first mapped this stuff because I'm done with it > > for now but might use it later'. > > From the implementation I get the opposite understanding. Neither tears down > the vma like a proper unmap(). MADV_DONTNEED zaps page tables immediately, > MADV_FREE effectively too but with a delay depending on memory pressure. > OK so based on IRC chat I think the conclusion here is TL;DR yes we have to change this, you're right :) To summarise for on-list: * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the ability to be 'cancelled' if you write to the memory. Also, after the freeing is complete, you can write to the memory to reuse it, the mapping is still there. * For hardware poison markers it makes sense to drop them as you're effectively saying 'I am done with this range that is now unbacked and expect to get an empty page should I use it now'. UFFD WP I am not sure about but presumably also fine. * However, guard pages are different - if you 'cancel' and you are left with a block of memory allocated to you by a pthread or userland allocator implementation, you don't want to then no longer be protected from overrunning into other thread memory. So, while I implemented this based on a. the semantics as apparently expressed in the man page and b. existing PTE marker behaviour, it seems that it would actually be problematic to do so. So yeah, I'll change that in v3!
On Mon, Oct 21, 2024 at 04:33:19PM +0100, Lorenzo Stoakes wrote: [snip] > * For hardware poison markers it makes sense to drop them as you're > effectively saying 'I am done with this range that is now unbacked and > expect to get an empty page should I use it now'. UFFD WP I am not sure > about but presumably also fine. [snip] CORRECTION: (sorry multitasking atm now I have a non-melted CPU) UFFD WP is safe from MADV_FREE as it checks is_poisoned_swp_entry() which this patch updates to include guard pages, a change I will undo in v3. So disregard comment on UFFD WP here.
On 21.10.24 17:33, Lorenzo Stoakes wrote: > On Mon, Oct 21, 2024 at 04:54:06PM +0200, Vlastimil Babka wrote: >> On 10/21/24 16:33, Lorenzo Stoakes wrote: >>> On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote: >>>> On 10/20/24 18:20, Lorenzo Stoakes wrote: >>>>> Add a new PTE marker that results in any access causing the accessing >>>>> process to segfault. >>>>> >>>>> This is preferable to PTE_MARKER_POISONED, which results in the same >>>>> handling as hardware poisoned memory, and is thus undesirable for cases >>>>> where we simply wish to 'soft' poison a range. >>>>> >>>>> This is in preparation for implementing the ability to specify guard pages >>>>> at the page table level, i.e. ranges that, when accessed, should cause >>>>> process termination. >>>>> >>>>> Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the >>>>> function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single >>>>> purpose was simply incorrect. >>>>> >>>>> We then reuse the same logic to determine whether a zap should clear a >>>>> guard entry - this should only be performed on teardown and never on >>>>> MADV_DONTNEED or the like. >>>> >>>> Since I would have personally put MADV_FREE among "or the like" here, it's >>>> surprising to me that it in fact it's tearing down the guard entries now. Is >>>> that intentional? It should be at least mentioned very explicitly. But I'd >>>> really argue against it, as MADV_FREE is to me a weaker form of >>>> MADV_DONTNEED - the existing pages are not zapped immediately but >>>> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why >>>> shouldn't MADV_FREE too? >>> >>> That is not, as I understand it, what MADV_FREE is, semantically. From the >>> man pages: >>> >>> MADV_FREE (since Linux 4.5) >>> >>> The application no longer requires the pages in the range >>> specified by addr and len. The kernel can thus free these >>> pages, but the freeing could be delayed until memory pressure >>> occurs. >>> >>> MADV_DONTNEED >>> >>> Do not expect access in the near future. (For the time >>> being, the application is finished with the given range, so >>> the kernel can free resources associated with it.) >>> >>> MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we >>> don't expect to use it in the near future'. >> >> I think the description gives a wrong impression. What I think matters it >> what happens (limited to anon private case as MADV_FREE doesn't support any >> other) >> >> MADV_DONTNEED - pages discarded immediately, further access gives new >> zero-filled pages >> >> MADV_FREE - pages prioritized for discarding, if that happens before next >> write, it gets zero-filled page on next access, but a write done soon enough >> can cancel the upcoming discard. >> >> In that sense, MADV_FREE is a weaker form of DONTNEED, no? >> >>>> >>>> Seems to me rather currently an artifact of MADV_FREE implementation - if it >>>> encounters hwpoison entries it will tear them down because why not, we have >>>> detected a hw memory error and are lucky the program wants to discard the >>>> pages and not access them, so best use the opportunity and get rid of the >>>> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly >>>> could). >>> >>> Right, but we explicitly do not tear them down in the case of MADV_DONTNEED >>> which matches the description in the manpages that the user _might_ come >>> back to the range, whereas MADV_FREE means they are truly done but just >>> don't want the overhead of actually unmapping at this point. >> >> But it's also defined what happens if user comes back to the range after a >> MADV_FREE. I think the overhead saved happens in the case of actually coming >> back soon enough to prevent the discard. With MADV_DONTNEED its immediate >> and unconditional. >> >>> Seems to be this is moreso that MADV_FREE is a not-really-as-efficient >>> version of what Rik wants to do with his MADV_LAZYFREE thing. >> >> I think that further optimizes MADV_FREE, which is already more optimized >> than MADV_DONTNEED. >> >>>> >>>> But to extend this to guard PTEs which are result of an explicit userspace >>>> action feels wrong, unless the semantics is the same for MADV_DONTEED. The >>>> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave >>>> the same? >>> >>> My understanding from the above is that MADV_FREE is a softer version of >>> munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a >>> 'revert state to when I first mapped this stuff because I'm done with it >>> for now but might use it later'. >> >> From the implementation I get the opposite understanding. Neither tears down >> the vma like a proper unmap(). MADV_DONTNEED zaps page tables immediately, >> MADV_FREE effectively too but with a delay depending on memory pressure. >> > > OK so based on IRC chat I think the conclusion here is TL;DR yes we have to > change this, you're right :) > > To summarise for on-list: > > * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the > ability to be 'cancelled' if you write to the memory. Also, after the > freeing is complete, you can write to the memory to reuse it, the mapping > is still there. > > * For hardware poison markers it makes sense to drop them as you're > effectively saying 'I am done with this range that is now unbacked and > expect to get an empty page should I use it now'. UFFD WP I am not sure > about but presumably also fine. > > * However, guard pages are different - if you 'cancel' and you are left > with a block of memory allocated to you by a pthread or userland > allocator implementation, you don't want to then no longer be protected > from overrunning into other thread memory. Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or error? It sounds like a usage "error" to me (in contrast to munmap()).
On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote: [snip] > > > > To summarise for on-list: > > > > * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the > > ability to be 'cancelled' if you write to the memory. Also, after the > > freeing is complete, you can write to the memory to reuse it, the mapping > > is still there. > > > > * For hardware poison markers it makes sense to drop them as you're > > effectively saying 'I am done with this range that is now unbacked and > > expect to get an empty page should I use it now'. UFFD WP I am not sure > > about but presumably also fine. > > > > * However, guard pages are different - if you 'cancel' and you are left > > with a block of memory allocated to you by a pthread or userland > > allocator implementation, you don't want to then no longer be protected > > from overrunning into other thread memory. > > Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or > error? It sounds like a usage "error" to me (in contrast to munmap()). It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in place, from v3 we will do the same for MADV_FREE. I'm not sure I'd say it's an error per se, as somebody might have a use case where they want to zap over a range but keep guard pages, perhaps an allocator or something? Also the existing logic is that existing markers (HW poison, uffd-simulated HW poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and no error on MADV_FREE either, so it'd be consistent with existing behaviour. Also semantically you are achieving what the calls expect you are freeing the ranges since the guard page regions are unbacked so are already freed... so yeah I don't think an error really makes sense here. We might also be limiting use cases by assuming they might _only_ be used for allocators and such. > > -- > Cheers, > > David / dhildenb >
On 21.10.24 18:23, Lorenzo Stoakes wrote: > On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote: > [snip] >>> >>> To summarise for on-list: >>> >>> * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the >>> ability to be 'cancelled' if you write to the memory. Also, after the >>> freeing is complete, you can write to the memory to reuse it, the mapping >>> is still there. >>> >>> * For hardware poison markers it makes sense to drop them as you're >>> effectively saying 'I am done with this range that is now unbacked and >>> expect to get an empty page should I use it now'. UFFD WP I am not sure >>> about but presumably also fine. >>> >>> * However, guard pages are different - if you 'cancel' and you are left >>> with a block of memory allocated to you by a pthread or userland >>> allocator implementation, you don't want to then no longer be protected >>> from overrunning into other thread memory. >> >> Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or >> error? It sounds like a usage "error" to me (in contrast to munmap()). > > It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in > place, from v3 we will do the same for MADV_FREE. > > I'm not sure I'd say it's an error per se, as somebody might have a use case > where they want to zap over a range but keep guard pages, perhaps an allocator > or something? Hm, not sure I see use for that. Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would process PROT_NONE. So current behavior is at least consistent with PROT_NONE handling (where something could be mapped, though). No strong opinion. > > Also the existing logic is that existing markers (HW poison, uffd-simulated HW > poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and > no error on MADV_FREE either, so it'd be consistent with existing behaviour. HW poison / uffd-simulated HW poison are expected to be zapped: it's just like a mapped page with HWPOISON. So that is correct. UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp entries. > > Also semantically you are achieving what the calls expect you are freeing the > ranges since the guard page regions are unbacked so are already freed... so yeah > I don't think an error really makes sense here. I you compare it to a VMA hole, it make sense to fail. If we treat it like PROT_NONE, it make sense to skip them. > > We might also be limiting use cases by assuming they might _only_ be used for > allocators and such. I don't buy that as an argument, sorry :) "Let's map the kernel writable into all user space because otherwise we might be limiting use cases" :P
On Mon, Oct 21, 2024 at 06:44:04PM +0200, David Hildenbrand wrote: > On 21.10.24 18:23, Lorenzo Stoakes wrote: > > On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote: > > [snip] > > > > > > > > To summarise for on-list: > > > > > > > > * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the > > > > ability to be 'cancelled' if you write to the memory. Also, after the > > > > freeing is complete, you can write to the memory to reuse it, the mapping > > > > is still there. > > > > > > > > * For hardware poison markers it makes sense to drop them as you're > > > > effectively saying 'I am done with this range that is now unbacked and > > > > expect to get an empty page should I use it now'. UFFD WP I am not sure > > > > about but presumably also fine. > > > > > > > > * However, guard pages are different - if you 'cancel' and you are left > > > > with a block of memory allocated to you by a pthread or userland > > > > allocator implementation, you don't want to then no longer be protected > > > > from overrunning into other thread memory. > > > > > > Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or > > > error? It sounds like a usage "error" to me (in contrast to munmap()). > > > > It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in > > place, from v3 we will do the same for MADV_FREE. > > > > I'm not sure I'd say it's an error per se, as somebody might have a use case > > where they want to zap over a range but keep guard pages, perhaps an allocator > > or something? > > Hm, not sure I see use for that. > > Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would > process PROT_NONE. So current behavior is at least consistent with PROT_NONE > handling (where something could be mapped, though). Err, the handling of holes is terrible, yes we return ENOMEM, but we _carry out the whole procedure_ then return an error, an error _indistinguishable from an error arising from any of the individual parts_. Which is just, awful. > > No strong opinion. Well you used up your strong opinion on the naming ;) > > > > > Also the existing logic is that existing markers (HW poison, uffd-simulated HW > > poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and > > no error on MADV_FREE either, so it'd be consistent with existing behaviour. > > > HW poison / uffd-simulated HW poison are expected to be zapped: it's just > like a mapped page with HWPOISON. So that is correct. Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I mean the MADV flags are a confusing mess generally, as per Vlasta's comments which to begin with I strongly disagreed with then, discussing further, realsed that no this is just a bit insane and had driven _me_ insane. > > UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp > entries. > > > > > Also semantically you are achieving what the calls expect you are freeing the > > ranges since the guard page regions are unbacked so are already freed... so yeah > > I don't think an error really makes sense here. > > I you compare it to a VMA hole, it make sense to fail. If we treat it like > PROT_NONE, it make sense to skip them. > > > > > We might also be limiting use cases by assuming they might _only_ be used for > > allocators and such. > > I don't buy that as an argument, sorry :) > > "Let's map the kernel writable into all user space because otherwise we > might be limiting use cases" That's a great idea! Patch series incoming, 1st April 2025... :>) > > > :P > > -- > Cheers, > > David / dhildenb > Overall I think just always leaving in place except on remedy err sorry sorry unpoison and munmap and not returning an error if encountered elsewhere (other than, of course, GUP) is the right way forward and most in line with user expectation and practical usage.
On 21.10.24 18:51, Lorenzo Stoakes wrote: > On Mon, Oct 21, 2024 at 06:44:04PM +0200, David Hildenbrand wrote: >> On 21.10.24 18:23, Lorenzo Stoakes wrote: >>> On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote: >>> [snip] >>>>> >>>>> To summarise for on-list: >>>>> >>>>> * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the >>>>> ability to be 'cancelled' if you write to the memory. Also, after the >>>>> freeing is complete, you can write to the memory to reuse it, the mapping >>>>> is still there. >>>>> >>>>> * For hardware poison markers it makes sense to drop them as you're >>>>> effectively saying 'I am done with this range that is now unbacked and >>>>> expect to get an empty page should I use it now'. UFFD WP I am not sure >>>>> about but presumably also fine. >>>>> >>>>> * However, guard pages are different - if you 'cancel' and you are left >>>>> with a block of memory allocated to you by a pthread or userland >>>>> allocator implementation, you don't want to then no longer be protected >>>>> from overrunning into other thread memory. >>>> >>>> Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or >>>> error? It sounds like a usage "error" to me (in contrast to munmap()). >>> >>> It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in >>> place, from v3 we will do the same for MADV_FREE. >>> >>> I'm not sure I'd say it's an error per se, as somebody might have a use case >>> where they want to zap over a range but keep guard pages, perhaps an allocator >>> or something? >> >> Hm, not sure I see use for that. >> >> Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would >> process PROT_NONE. So current behavior is at least consistent with PROT_NONE >> handling (where something could be mapped, though). > > Err, the handling of holes is terrible, yes we return ENOMEM, but we _carry out > the whole procedure_ then return an error, an error _indistinguishable from an > error arising from any of the individual parts_. > > Which is just, awful. Yes, absolutely. I don't know why we decided to continue. And why we return ENOMEM ... > >> >> No strong opinion. > > Well you used up your strong opinion on the naming ;) He, and I am out of energy for this year ;) In retrospective, "install or remove a guard PTE" is just much better than anything else ... So I should never have been mislead to suggest poison/unpoison as a replacement for poison/remedy :P > >> >>> >>> Also the existing logic is that existing markers (HW poison, uffd-simulated HW >>> poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and >>> no error on MADV_FREE either, so it'd be consistent with existing behaviour. >> >> >> HW poison / uffd-simulated HW poison are expected to be zapped: it's just >> like a mapped page with HWPOISON. So that is correct. > > Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I Huh? madvise_dontneed_single_vma()->zap_page_range_single(details=NULL)->unmap_single_vma(details=NULL) ... zap_pte_range() } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { if (!should_zap_cows(details)) continue; ... Should just zap them. What am I missing? > mean the MADV flags are a confusing mess generally, as per Vlasta's comments > which to begin with I strongly disagreed with then, discussing further, realsed > that no this is just a bit insane and had driven _me_ insane. > >> >> UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp >> entries. >> >>> >>> Also semantically you are achieving what the calls expect you are freeing the >>> ranges since the guard page regions are unbacked so are already freed... so yeah >>> I don't think an error really makes sense here. >> >> I you compare it to a VMA hole, it make sense to fail. If we treat it like >> PROT_NONE, it make sense to skip them. >> >>> >>> We might also be limiting use cases by assuming they might _only_ be used for >>> allocators and such. >> >> I don't buy that as an argument, sorry :) >> >> "Let's map the kernel writable into all user space because otherwise we >> might be limiting use cases" > > That's a great idea! Patch series incoming, 1st April 2025... :>) :) Just flip the bit on x86 and we're done! >> >> >> :P >> >> -- >> Cheers, >> >> David / dhildenb >> > > Overall I think just always leaving in place except on remedy err sorry sorry > unpoison and munmap and not returning an error if encountered elsewhere (other > than, of course, GUP) is the right way forward and most in line with user > expectation and practical usage. Fine with me, make sure to document that is behaves like a PROT_NONE VMA, not like a memory hole, except when something would trigger a fault (GUP etc).
On Mon, Oct 21, 2024 at 07:00:53PM +0200, David Hildenbrand wrote: > On 21.10.24 18:51, Lorenzo Stoakes wrote: > > On Mon, Oct 21, 2024 at 06:44:04PM +0200, David Hildenbrand wrote: > > > On 21.10.24 18:23, Lorenzo Stoakes wrote: > > > > On Mon, Oct 21, 2024 at 06:00:20PM +0200, David Hildenbrand wrote: > > > > [snip] > > > > > > > > > > > > To summarise for on-list: > > > > > > > > > > > > * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the > > > > > > ability to be 'cancelled' if you write to the memory. Also, after the > > > > > > freeing is complete, you can write to the memory to reuse it, the mapping > > > > > > is still there. > > > > > > > > > > > > * For hardware poison markers it makes sense to drop them as you're > > > > > > effectively saying 'I am done with this range that is now unbacked and > > > > > > expect to get an empty page should I use it now'. UFFD WP I am not sure > > > > > > about but presumably also fine. > > > > > > > > > > > > * However, guard pages are different - if you 'cancel' and you are left > > > > > > with a block of memory allocated to you by a pthread or userland > > > > > > allocator implementation, you don't want to then no longer be protected > > > > > > from overrunning into other thread memory. > > > > > > > > > > Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored or > > > > > error? It sounds like a usage "error" to me (in contrast to munmap()). > > > > > > > > It's ignored, no errror. On MADV_DONTNEED we already left the guard pages in > > > > place, from v3 we will do the same for MADV_FREE. > > > > > > > > I'm not sure I'd say it's an error per se, as somebody might have a use case > > > > where they want to zap over a range but keep guard pages, perhaps an allocator > > > > or something? > > > > > > Hm, not sure I see use for that. > > > > > > Staring at madvise_walk_vmas(), we return ENOMEM on VMA holes, but would > > > process PROT_NONE. So current behavior is at least consistent with PROT_NONE > > > handling (where something could be mapped, though). > > > > Err, the handling of holes is terrible, yes we return ENOMEM, but we _carry out > > the whole procedure_ then return an error, an error _indistinguishable from an > > error arising from any of the individual parts_. > > > > Which is just, awful. > > Yes, absolutely. I don't know why we decided to continue. And why we return > ENOMEM ... Anyway UAPI so no turning back now :) > > > > > > > > > No strong opinion. > > > > Well you used up your strong opinion on the naming ;) > > He, and I am out of energy for this year ;) Haha understandable... > > In retrospective, "install or remove a guard PTE" is just much better than > anything else ... > > So I should never have been mislead to suggest poison/unpoison as a > replacement for poison/remedy :P You know the reason ;) > > > > > > > > > > > > > > Also the existing logic is that existing markers (HW poison, uffd-simulated HW > > > > poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and > > > > no error on MADV_FREE either, so it'd be consistent with existing behaviour. > > > > > > > > > HW poison / uffd-simulated HW poison are expected to be zapped: it's just > > > like a mapped page with HWPOISON. So that is correct. > > > > Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I > > Huh? > > madvise_dontneed_single_vma()->zap_page_range_single(details=NULL)->unmap_single_vma(details=NULL) > ... zap_pte_range() > > } else if (is_hwpoison_entry(entry) || > is_poisoned_swp_entry(entry)) { > if (!should_zap_cows(details)) > continue; > ... > > Should just zap them. > > What am I missing? Yeah ok it's me who's missing something here, I hadn't noticed details == NULL so should_zap_cows() is true, my mistake! In any case we explicitly add code here to prevent guard pages from going. I will correct everything where I wrongly say otherwise, doh! > > > mean the MADV flags are a confusing mess generally, as per Vlasta's comments > > which to begin with I strongly disagreed with then, discussing further, realsed > > that no this is just a bit insane and had driven _me_ insane. > > > > > > > > UFFD-WP behavior is ... weird. Would not expect MADV_DONTNEED to zap uffd-wp > > > entries. > > > > > > > > > > > Also semantically you are achieving what the calls expect you are freeing the > > > > ranges since the guard page regions are unbacked so are already freed... so yeah > > > > I don't think an error really makes sense here. > > > > > > I you compare it to a VMA hole, it make sense to fail. If we treat it like > > > PROT_NONE, it make sense to skip them. > > > > > > > > > > > We might also be limiting use cases by assuming they might _only_ be used for > > > > allocators and such. > > > > > > I don't buy that as an argument, sorry :) > > > > > > "Let's map the kernel writable into all user space because otherwise we > > > might be limiting use cases" > > > > That's a great idea! Patch series incoming, 1st April 2025... :>) > > :) Just flip the bit on x86 and we're done! ;) > > > > > > > > > > :P > > > > > > -- > > > Cheers, > > > > > > David / dhildenb > > > > > > > Overall I think just always leaving in place except on remedy err sorry sorry > > unpoison and munmap and not returning an error if encountered elsewhere (other > > than, of course, GUP) is the right way forward and most in line with user > > expectation and practical usage. > > > Fine with me, make sure to document that is behaves like a PROT_NONE VMA, > not like a memory hole, except when something would trigger a fault (GUP > etc). Ack will make sure to document. > > > -- > Cheers, > > David / dhildenb >
On 21.10.24 19:26, Vlastimil Babka wrote: > On 10/21/24 19:14, Lorenzo Stoakes wrote: >> On Mon, Oct 21, 2024 at 07:00:53PM +0200, David Hildenbrand wrote: >>> >>>> >>>>> >>>>>> >>>>>> Also the existing logic is that existing markers (HW poison, uffd-simulated HW >>>>>> poison, uffd wp marker) are retained and no error raised on MADV_DONTNEED, and >>>>>> no error on MADV_FREE either, so it'd be consistent with existing behaviour. >>>>> >>>>> >>>>> HW poison / uffd-simulated HW poison are expected to be zapped: it's just >>>>> like a mapped page with HWPOISON. So that is correct. >>>> >>>> Well, poison is _not_ zapped on MADV_DONTNEED but _is_ on MADV_FREE :) anyway, I >>> >>> Huh? >>> >>> madvise_dontneed_single_vma()->zap_page_range_single(details=NULL)->unmap_single_vma(details=NULL) >>> ... zap_pte_range() >>> >>> } else if (is_hwpoison_entry(entry) || >>> is_poisoned_swp_entry(entry)) { >>> if (!should_zap_cows(details)) >>> continue; >>> ... >>> >>> Should just zap them. >>> >>> What am I missing? >> >> Yeah ok it's me who's missing something here, I hadn't noticed details == NULL >> so should_zap_cows() is true, my mistake! > > Well, good to know it's consistent then. As I've explained I see why zapping > actual hwpoison makes sense for MADV_DONTNEED/MADV_FREE. That it's done also > for uffd poison is not completely clear, but maybe it was just easier to > implement. Note that in VM context "uffd poison" really just is "this was hwpoison on the source VM, so we mimic that on the destination VM, because the data *is* lost" -- so you want the exact same behavior. For example, when a VM reboots you might just want to ZAP these hwpoison entries, and get fresh pages on next access. So to me it makes sense that they are treated equally.