mbox series

[v3,0/5] implement lightweight guard pages

Message ID cover.1729699916.git.lorenzo.stoakes@oracle.com
Headers show
Series implement lightweight guard pages | expand

Message

Lorenzo Stoakes Oct. 23, 2024, 4:24 p.m. UTC
Userland library functions such as allocators and threading implementations
often require regions of memory to act as 'guard pages' - mappings which,
when accessed, result in a fatal signal being sent to the accessing
process.

The current means by which these are implemented is via a PROT_NONE mmap()
mapping, which provides the required semantics however incur an overhead of
a VMA for each such region.

With a great many processes and threads, this can rapidly add up and incur
a significant memory penalty. It also has the added problem of preventing
merges that might otherwise be permitted.

This series takes a different approach - an idea suggested by Vlasimil
Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
provenance becomes a little tricky to ascertain after this - please forgive
any omissions!)  - rather than locating the guard pages at the VMA layer,
instead placing them in page tables mapping the required ranges.

Early testing of the prototype version of this code suggests a 5 times
speed up in memory mapping invocations (in conjunction with use of
process_madvise()) and a 13% reduction in VMAs on an entirely idle android
system and unoptimised code.

We expect with optimisation and a loaded system with a larger number of
guard pages this could significantly increase, but in any case these
numbers are encouraging.

This way, rather than having separate VMAs specifying which parts of a
range are guard pages, instead we have a VMA spanning the entire range of
memory a user is permitted to access and including ranges which are to be
'guarded'.

After mapping this, a user can specify which parts of the range should
result in a fatal signal when accessed.

By restricting the ability to specify guard pages to memory mapped by
existing VMAs, we can rely on the mappings being torn down when the
mappings are ultimately unmapped and everything works simply as if the
memory were not faulted in, from the point of view of the containing VMAs.

This mechanism in effect poisons memory ranges similar to hardware memory
poisoning, only it is an entirely software-controlled form of poisoning.

The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL
which installs page table-level guard page markers - and
MADV_GUARD_REMOVE - which clears them.

Guard markers can be installed across multiple VMAs and any existing
mappings will be cleared, that is zapped, before installing the guard page
markers in the page tables.

There is no concept of 'nested' guard markers, multiple attempts to install
guard markers in a range will, after the first attempt, have no effect.

Importantly, removing guard markers over a range that contains both guard
markers and ordinary backed memory has no effect on anything but the guard
markers (including leaving huge pages un-split), so a user can safely
remove guard markers over a range of memory leaving the rest intact.

The actual mechanism by which the page table entries are specified makes
use of existing logic - PTE markers, which are used for the userfaultfd
UFFDIO_POISON mechanism.

Unfortunately PTE_MARKER_POISONED is not suited for the guard page
mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
logic to handle it.

We also extend the generic page walk mechanism to allow for installation of
PTEs (carefully restricted to memory management logic only to prevent
unwanted abuse).

We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not
remove guard markers, nor does forking (except when VM_WIPEONFORK is
specified for a VMA which implies a total removal of memory
characteristics).

It's important to note that the guard page implementation is emphatically
NOT a security feature, so a user can remove the markers if they wish. We
simply implement it in such a way as to provide the least surprising
behaviour.

An extensive set of self-tests are provided which ensure behaviour is as
expected and additionally self-documents expected behaviour of guard
ranges.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>

v3
* Cleaned up mm/pagewalk.c logic a bit to make things clearer, as suggested
  by Vlastiml.
* Explicitly avoid splitting THP on PTE installation, as suggested by
  Vlastimil. Note this has no impact on the guard pages logic, which has
  page table entry handlers at PUD, PMD and PTE level.
* Added WARN_ON_ONCE() to mm/hugetlb.c path where we don't expect a guard
  marker, as suggested by Vlastimil.
* Reverted change to is_poisoned_swp_entry() to exclude guard pages which
  has the effect of MADV_FREE _not_ clearing guard pages. After discussion
  with Vlastimil, it became apparent that the ability to 'cancel' the
  freeing operation by writing to the mapping after having issued an
  MADV_FREE would mean that we would risk unexpected behaviour should the
  guard pages be removed, so we now do not remove markers here at all.
* Added comment to PTE_MARKER_GUARD to highlight that memory tagged with
  the marker behaves as if it were a region mapped PROT_NONE, as
  highlighted by David.
* Rename poison -> install, unpoison -> remove (i.e. MADV_GUARD_INSTALL /
  MADV_GUARD_REMOVE over MADV_GUARD_POISON / MADV_GUARD_REMOVE) at the
  request of David and John who both find the poison analogy
  confusing/overloaded.
* After a lot of discussion, replace the looping behaviour should page
  faults race with guard page installation with a modest reattempt followed
  by returning -ERESTARTNOINTR to have the operation abort and re-enter,
  relieving lock contention and avoiding the possibility of allowing a
  malicious sandboxed process to impact the mmap lock or stall the overall
  process more than necessary, as suggested by Jann and Vlastimil having
  raised the issue.
* Adjusted the page table walker so a populated huge PUD or PMD is
  correctly treated as being populated, necessitating a zap. In v2 we
  incorrectly skipped over these, which would cause the logic to wrongly
  proceed as if nothing were populated and the install succeeded.
  Instead, explicitly check to see if a huge page - if so, do not split but
  rather abort the operation and let zap take care of things.
* Updated the guard remove logic to not unnecessarily split huge pages
  either.
* Added a debug check to assert that the number of installed PTEs matches
  expectation, accounting for any existing guard pages.
* Adapted vector_madvise() used by the process_madvise() system call to
  handle -ERESTARTNOINTR correctly.

v2
* The macros in kselftest_harness.h seem to be broken - __EXPECT() is
  terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not
  safe in single line if / else or for /which blocks, however working
  around this results in checkpatch producing invalid warnings, as reported
  by Shuah.
* Fixing these macros is out of scope for this series, so compromise and
  instead rewrite test blocks so as to use multiple lines by separating out
  a decl in most cases. This has the side effect of, for the most part,
  making things more readable.
* Heavily document the use of the volatile keyword - we can't avoid
  checkpatch complaining about this, so we explain it, as reported by
  Shuah.
* Updated commit message to highlight that we skip tests we lack
  permissions for, as reported by Shuah.
* Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah.
* Added user friendly messages to cases where tests are skipped due to lack
  of permissions, as reported by Shuah.
* Update the tool header to include the new MADV_GUARD_POISON/UNPOISON
  defines and directly include asm-generic/mman.h to get the
  platform-neutral versions to ensure we import them.
* Finally fixed Vlastimil's email address in Suggested-by tags from suze to
  suse, as reported by Vlastimil.
* Added linux-api to cc list, as reported by Vlastimil.
https://lore.kernel.org/all/cover.1729440856.git.lorenzo.stoakes@oracle.com/

v1
* Un-RFC'd as appears no major objections to approach but rather debate on
  implementation.
* Fixed issue with arches which need mmu_context.h and
  tlbfush.h. header imports in pagewalker logic to be able to use
  update_mmu_cache() as reported by the kernel test bot.
* Added comments in page walker logic to clarify who can use
  ops->install_pte and why as well as adding a check_ops_valid() helper
  function, as suggested by Christoph.
* Pass false in full parameter in pte_clear_not_present_full() as suggested
  by Jann.
* Stopped erroneously requiring a write lock for the poison operation as
  suggested by Jann and Suren.
* Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
  consistent with how this is used elsewhere in the kernel as suggested by
  Jann.
* Avoid returning -EAGAIN if we are raced on page faults, just keep looping
  and duck out if a fatal signal is pending or a conditional reschedule is
  needed, as suggested by Jann.
* Avoid needlessly splitting huge PUDs and PMDs by specifying
  ACTION_CONTINUE, as suggested by Jann.
https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@oracle.com/

RFC
https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (5):
  mm: pagewalk: add the ability to install PTEs
  mm: add PTE_MARKER_GUARD PTE marker
  mm: madvise: implement lightweight guard page mechanism
  tools: testing: update tools UAPI header for mman-common.h
  selftests/mm: add self tests for guard page feature

 arch/alpha/include/uapi/asm/mman.h           |    3 +
 arch/mips/include/uapi/asm/mman.h            |    3 +
 arch/parisc/include/uapi/asm/mman.h          |    3 +
 arch/xtensa/include/uapi/asm/mman.h          |    3 +
 include/linux/mm_inline.h                    |    2 +-
 include/linux/pagewalk.h                     |   18 +-
 include/linux/swapops.h                      |   24 +-
 include/uapi/asm-generic/mman-common.h       |    3 +
 mm/hugetlb.c                                 |    4 +
 mm/internal.h                                |   12 +
 mm/madvise.c                                 |  225 ++++
 mm/memory.c                                  |   18 +-
 mm/mprotect.c                                |    6 +-
 mm/mseal.c                                   |    1 +
 mm/pagewalk.c                                |  227 +++-
 tools/include/uapi/asm-generic/mman-common.h |    3 +
 tools/testing/selftests/mm/.gitignore        |    1 +
 tools/testing/selftests/mm/Makefile          |    1 +
 tools/testing/selftests/mm/guard-pages.c     | 1239 ++++++++++++++++++
 19 files changed, 1720 insertions(+), 76 deletions(-)
 create mode 100644 tools/testing/selftests/mm/guard-pages.c

--
2.47.0

Comments

Andrew Morton Oct. 23, 2024, 11:04 p.m. UTC | #1
On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> 
> ...
>
> Existing mechanism for performing a walk which also installs page table
> entries if necessary are heavily duplicated throughout the kernel,

How complicated is it to migrate those to use this?

> ...
>
> We explicitly do not allow the existing page walk API to expose this
> feature as it is dangerous and intended for internal mm use only. Therefore
> we provide a new walk_page_range_mm() function exposed only to
> mm/internal.h.
>

Is this decision compatible with the above migrations?
Lorenzo Stoakes Oct. 24, 2024, 7:34 a.m. UTC | #2
On Wed, Oct 23, 2024 at 04:04:05PM -0700, Andrew Morton wrote:
> On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> >
> > ...
> >
> > Existing mechanism for performing a walk which also installs page table
> > entries if necessary are heavily duplicated throughout the kernel,
>
> How complicated is it to migrate those to use this?

I would say probably somewhat difficult as very often people are doing quite
custom things, but I will take a look at seeing if we can't make things a little
more generic.

I am also mildly motivated to look at trying to find a generic way to do
replaces...

Both on the TODO!

>
> > ...
> >
> > We explicitly do not allow the existing page walk API to expose this
> > feature as it is dangerous and intended for internal mm use only. Therefore
> > we provide a new walk_page_range_mm() function exposed only to
> > mm/internal.h.
> >
>
> Is this decision compatible with the above migrations?

Yes, nobody but mm should be doing this. You have to be enormously careful about
locks, caches, accounting and the various other machinery around actually
setting a PTE, and this allows you to establish entirely new mappings, so could
very much be abused.
Lorenzo Stoakes Oct. 24, 2024, 8:07 a.m. UTC | #3
On Thu, Oct 24, 2024 at 09:45:37AM +0200, David Hildenbrand wrote:
> On 24.10.24 09:34, Lorenzo Stoakes wrote:
> > On Wed, Oct 23, 2024 at 04:04:05PM -0700, Andrew Morton wrote:
> > > On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > >
> > > > ...
> > > >
> > > > Existing mechanism for performing a walk which also installs page table
> > > > entries if necessary are heavily duplicated throughout the kernel,
> > >
> > > How complicated is it to migrate those to use this?
> >
> > I would say probably somewhat difficult as very often people are doing quite
> > custom things, but I will take a look at seeing if we can't make things a little
> > more generic.
> >
> > I am also mildly motivated to look at trying to find a generic way to do
> > replaces...
> >
> > Both on the TODO!
>
> I'm not super happy about extending the rusty old pagewalk API, because it's
> inefficient (indirect calls) and not future proof (batching, large folios).

Yeah it could be improved, but I think the ideal way would be to genericise as
much as we can and 'upgrade' this logic.

>
> But I see how we ended up with this patch, and it will be easy to convert to
> something better once we have it.

Yes, I was quite happy with what an ultimately small delta this all ended up
being vs. the various other alternatives (change zap logic, introduce custom
page fault mechanism, duplicating page walk code _yet again_, porting uffd
logic, etc. etc.)

But in an ideal world we'd have _one_ place that does this.

>
> We already discussed in the past that we need a better and more efficient
> way to walk page tables. I have part of that on my TODO list, but I'm
> getting distracted.

Yes I remember an LSF session on this, it's a really obvious area of improvement
that stands out at the moment for sure.

Having worked several 12+ hour days in a row though recently I can relate to
workload making this difficult though :)

>
> *Inserting* (not walking/modifying existing things as most users to) as done
> in this patch is slightly different though, likely "on thing that fits all"
> will not apply to all page table walker user cases.

Yeah, there's also replace scenarios which then have to do egregious amounts of
work to make sure we do everything right, in fact there's duplicates of this in
mm/madvise.c *grumble grumble*.

>
> --
> Cheers,
>
> David / dhildenb
>

OK so I guess I'll hold off my TODOs on this as you are looking in this area and
I trust you :)

Cheers!
Vlastimil Babka Oct. 25, 2024, 6:13 p.m. UTC | #4
On 10/23/24 18:24, Lorenzo Stoakes wrote:
> The existing generic pagewalk logic permits the walking of page tables,
> invoking callbacks at individual page table levels via user-provided
> mm_walk_ops callbacks.
> 
> This is useful for traversing existing page table entries, but precludes
> the ability to establish new ones.
> 
> Existing mechanism for performing a walk which also installs page table
> entries if necessary are heavily duplicated throughout the kernel, each
> with semantic differences from one another and largely unavailable for use
> elsewhere.
> 
> Rather than add yet another implementation, we extend the generic pagewalk
> logic to enable the installation of page table entries by adding a new
> install_pte() callback in mm_walk_ops. If this is specified, then upon
> encountering a missing page table entry, we allocate and install a new one
> and continue the traversal.
> 
> If a THP huge page is encountered at either the PMD or PUD level we split
> it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
> otherwise if there is only an ops->install_pte(), we avoid the unnecessary
> split.
> 
> We do not support hugetlb at this stage.
> 
> If this function returns an error, or an allocation fails during the
> operation, we abort the operation altogether. It is up to the caller to
> deal appropriately with partially populated page table ranges.
> 
> If install_pte() is defined, the semantics of pte_entry() change - this
> callback is then only invoked if the entry already exists. This is a useful
> property, as it allows a caller to handle existing PTEs while installing
> new ones where necessary in the specified range.
> 
> If install_pte() is not defined, then there is no functional difference to
> this patch, so all existing logic will work precisely as it did before.
> 
> As we only permit the installation of PTEs where a mapping does not already
> exist there is no need for TLB management, however we do invoke
> update_mmu_cache() for architectures which require manual maintenance of
> mappings for other CPUs.
> 
> We explicitly do not allow the existing page walk API to expose this
> feature as it is dangerous and intended for internal mm use only. Therefore
> we provide a new walk_page_range_mm() function exposed only to
> mm/internal.h.
> 
> Reviewed-by: Jann Horn <jannh@google.com>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Just a small subjective suggestion in case you agree and there's a respin or
followups:

> @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  
>  		if (walk->action == ACTION_AGAIN)
>  			goto again;
> -
> -		/*
> -		 * Check this here so we only break down trans_huge
> -		 * pages when we _need_ to
> -		 */
> -		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
> -		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pte_entry))
> +		if (walk->action == ACTION_CONTINUE)
>  			continue;
> +		if (!ops->install_pte && !ops->pte_entry)
> +			continue; /* Nothing to do. */
> +		if (!ops->pte_entry && ops->install_pte &&
> +		    pmd_present(*pmd) &&
> +		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> +			continue; /* Avoid unnecessary split. */

Much better now, thanks, but maybe the last 2 parts could be:

if (!ops->pte_entry) {
	if (!ops->install_pte)
		continue; /* Nothing to do. */
	else if (pmd_present(*pmd)
		 && (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
		continue; /* Avoid unnecessary split. */
}

Or at least put !ops->pte_entry first in both conditions?

>  		if (walk->vma)
>  			split_huge_pmd(walk->vma, pmd, addr);
> +		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> +			continue; /* Nothing to do. */
>  
>  		err = walk_pte_range(pmd, addr, next, walk);
>  		if (err)
> @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>   again:
>  		next = pud_addr_end(addr, end);
>  		if (pud_none(*pud)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pmd_alloc(walk->mm, pud, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  
>  		walk->action = ACTION_SUBTREE;
> @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>  
>  		if (walk->action == ACTION_AGAIN)
>  			goto again;
> -
> -		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
> -		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pmd_entry || ops->pte_entry))
> +		if (walk->action == ACTION_CONTINUE)
>  			continue;
> +		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
> +			continue;  /* Nothing to do. */
> +		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
> +		    pud_present(*pud) &&
> +		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
> +			continue; /* Avoid unnecessary split. */

Ditto.

Thanks!
David Hildenbrand Oct. 25, 2024, 7:08 p.m. UTC | #5
>>
>> We already discussed in the past that we need a better and more efficient
>> way to walk page tables. I have part of that on my TODO list, but I'm
>> getting distracted.
> 
> Yes I remember an LSF session on this, it's a really obvious area of improvement
> that stands out at the moment for sure.
> 
> Having worked several 12+ hour days in a row though recently I can relate to
> workload making this difficult though :)

Yes :)

> 
>>
>> *Inserting* (not walking/modifying existing things as most users to) as done
>> in this patch is slightly different though, likely "on thing that fits all"
>> will not apply to all page table walker user cases.
> 
> Yeah, there's also replace scenarios which then have to do egregious amounts of
> work to make sure we do everything right, in fact there's duplicates of this in
> mm/madvise.c *grumble grumble*.
> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> OK so I guess I'll hold off my TODOs on this as you are looking in this area and
> I trust you :)

It will probably take me a while until I get to it, though. I'd focus on 
walking (and batching what we can) first, then on top modifying existing 
entries.

The "install something where there is nothing yet" (incl. populating 
fresh page tables etc.) case probably deserves a separate "walker".

If you end up having spare cycles and want to sync on a possible design 
for some part of that bigger task -- removing the old pagewalk logic -- 
please do reach out! :)
Lorenzo Stoakes Oct. 25, 2024, 9:58 p.m. UTC | #6
On Fri, Oct 25, 2024 at 08:13:26PM +0200, Vlastimil Babka wrote:
> On 10/23/24 18:24, Lorenzo Stoakes wrote:
> > The existing generic pagewalk logic permits the walking of page tables,
> > invoking callbacks at individual page table levels via user-provided
> > mm_walk_ops callbacks.
> >
> > This is useful for traversing existing page table entries, but precludes
> > the ability to establish new ones.
> >
> > Existing mechanism for performing a walk which also installs page table
> > entries if necessary are heavily duplicated throughout the kernel, each
> > with semantic differences from one another and largely unavailable for use
> > elsewhere.
> >
> > Rather than add yet another implementation, we extend the generic pagewalk
> > logic to enable the installation of page table entries by adding a new
> > install_pte() callback in mm_walk_ops. If this is specified, then upon
> > encountering a missing page table entry, we allocate and install a new one
> > and continue the traversal.
> >
> > If a THP huge page is encountered at either the PMD or PUD level we split
> > it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
> > otherwise if there is only an ops->install_pte(), we avoid the unnecessary
> > split.
> >
> > We do not support hugetlb at this stage.
> >
> > If this function returns an error, or an allocation fails during the
> > operation, we abort the operation altogether. It is up to the caller to
> > deal appropriately with partially populated page table ranges.
> >
> > If install_pte() is defined, the semantics of pte_entry() change - this
> > callback is then only invoked if the entry already exists. This is a useful
> > property, as it allows a caller to handle existing PTEs while installing
> > new ones where necessary in the specified range.
> >
> > If install_pte() is not defined, then there is no functional difference to
> > this patch, so all existing logic will work precisely as it did before.
> >
> > As we only permit the installation of PTEs where a mapping does not already
> > exist there is no need for TLB management, however we do invoke
> > update_mmu_cache() for architectures which require manual maintenance of
> > mappings for other CPUs.
> >
> > We explicitly do not allow the existing page walk API to expose this
> > feature as it is dangerous and intended for internal mm use only. Therefore
> > we provide a new walk_page_range_mm() function exposed only to
> > mm/internal.h.
> >
> > Reviewed-by: Jann Horn <jannh@google.com>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

>
> Just a small subjective suggestion in case you agree and there's a respin or
> followups:
>
> > @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >
> >  		if (walk->action == ACTION_AGAIN)
> >  			goto again;
> > -
> > -		/*
> > -		 * Check this here so we only break down trans_huge
> > -		 * pages when we _need_ to
> > -		 */
> > -		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
> > -		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pte_entry))
> > +		if (walk->action == ACTION_CONTINUE)
> >  			continue;
> > +		if (!ops->install_pte && !ops->pte_entry)
> > +			continue; /* Nothing to do. */
> > +		if (!ops->pte_entry && ops->install_pte &&
> > +		    pmd_present(*pmd) &&
> > +		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> > +			continue; /* Avoid unnecessary split. */
>
> Much better now, thanks, but maybe the last 2 parts could be:
>
> if (!ops->pte_entry) {
> 	if (!ops->install_pte)
> 		continue; /* Nothing to do. */
> 	else if (pmd_present(*pmd)
> 		 && (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> 		continue; /* Avoid unnecessary split. */
> }

I quite liked separating out the 'nothing to do' vs. the 'unnecessary split'
cases, but I agree it makes it harder to see that the 2nd case is an 'install
pte ONLY' case.

Yeah so I think your version is better, but maybe we can find a way to be more
expressive somehow... if we could declare vars mid-way thhrough it'd be easier
:P

Will improve on respin if it comes up

>
> Or at least put !ops->pte_entry first in both conditions?

Ack yeah that'd be better!

>
> >  		if (walk->vma)
> >  			split_huge_pmd(walk->vma, pmd, addr);
> > +		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> > +			continue; /* Nothing to do. */
> >
> >  		err = walk_pte_range(pmd, addr, next, walk);
> >  		if (err)
> > @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >   again:
> >  		next = pud_addr_end(addr, end);
> >  		if (pud_none(*pud)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __pmd_alloc(walk->mm, pud, addr);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, depth, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >
> >  		walk->action = ACTION_SUBTREE;
> > @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >
> >  		if (walk->action == ACTION_AGAIN)
> >  			goto again;
> > -
> > -		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
> > -		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pmd_entry || ops->pte_entry))
> > +		if (walk->action == ACTION_CONTINUE)
> >  			continue;
> > +		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
> > +			continue;  /* Nothing to do. */
> > +		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
> > +		    pud_present(*pud) &&
> > +		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
> > +			continue; /* Avoid unnecessary split. */
>
> Ditto.

Ack!

>
> Thanks!
>

Cheers!