Message ID | cover.1729699916.git.lorenzo.stoakes@oracle.com |
---|---|
Headers | show |
Series | implement lightweight guard pages | expand |
On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > ... > > Existing mechanism for performing a walk which also installs page table > entries if necessary are heavily duplicated throughout the kernel, How complicated is it to migrate those to use this? > ... > > We explicitly do not allow the existing page walk API to expose this > feature as it is dangerous and intended for internal mm use only. Therefore > we provide a new walk_page_range_mm() function exposed only to > mm/internal.h. > Is this decision compatible with the above migrations?
On Wed, Oct 23, 2024 at 04:04:05PM -0700, Andrew Morton wrote: > On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > > > ... > > > > Existing mechanism for performing a walk which also installs page table > > entries if necessary are heavily duplicated throughout the kernel, > > How complicated is it to migrate those to use this? I would say probably somewhat difficult as very often people are doing quite custom things, but I will take a look at seeing if we can't make things a little more generic. I am also mildly motivated to look at trying to find a generic way to do replaces... Both on the TODO! > > > ... > > > > We explicitly do not allow the existing page walk API to expose this > > feature as it is dangerous and intended for internal mm use only. Therefore > > we provide a new walk_page_range_mm() function exposed only to > > mm/internal.h. > > > > Is this decision compatible with the above migrations? Yes, nobody but mm should be doing this. You have to be enormously careful about locks, caches, accounting and the various other machinery around actually setting a PTE, and this allows you to establish entirely new mappings, so could very much be abused.
On Thu, Oct 24, 2024 at 09:45:37AM +0200, David Hildenbrand wrote: > On 24.10.24 09:34, Lorenzo Stoakes wrote: > > On Wed, Oct 23, 2024 at 04:04:05PM -0700, Andrew Morton wrote: > > > On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > > > > > > > > > ... > > > > > > > > Existing mechanism for performing a walk which also installs page table > > > > entries if necessary are heavily duplicated throughout the kernel, > > > > > > How complicated is it to migrate those to use this? > > > > I would say probably somewhat difficult as very often people are doing quite > > custom things, but I will take a look at seeing if we can't make things a little > > more generic. > > > > I am also mildly motivated to look at trying to find a generic way to do > > replaces... > > > > Both on the TODO! > > I'm not super happy about extending the rusty old pagewalk API, because it's > inefficient (indirect calls) and not future proof (batching, large folios). Yeah it could be improved, but I think the ideal way would be to genericise as much as we can and 'upgrade' this logic. > > But I see how we ended up with this patch, and it will be easy to convert to > something better once we have it. Yes, I was quite happy with what an ultimately small delta this all ended up being vs. the various other alternatives (change zap logic, introduce custom page fault mechanism, duplicating page walk code _yet again_, porting uffd logic, etc. etc.) But in an ideal world we'd have _one_ place that does this. > > We already discussed in the past that we need a better and more efficient > way to walk page tables. I have part of that on my TODO list, but I'm > getting distracted. Yes I remember an LSF session on this, it's a really obvious area of improvement that stands out at the moment for sure. Having worked several 12+ hour days in a row though recently I can relate to workload making this difficult though :) > > *Inserting* (not walking/modifying existing things as most users to) as done > in this patch is slightly different though, likely "on thing that fits all" > will not apply to all page table walker user cases. Yeah, there's also replace scenarios which then have to do egregious amounts of work to make sure we do everything right, in fact there's duplicates of this in mm/madvise.c *grumble grumble*. > > -- > Cheers, > > David / dhildenb > OK so I guess I'll hold off my TODOs on this as you are looking in this area and I trust you :) Cheers!
On 10/23/24 18:24, Lorenzo Stoakes wrote: > The existing generic pagewalk logic permits the walking of page tables, > invoking callbacks at individual page table levels via user-provided > mm_walk_ops callbacks. > > This is useful for traversing existing page table entries, but precludes > the ability to establish new ones. > > Existing mechanism for performing a walk which also installs page table > entries if necessary are heavily duplicated throughout the kernel, each > with semantic differences from one another and largely unavailable for use > elsewhere. > > Rather than add yet another implementation, we extend the generic pagewalk > logic to enable the installation of page table entries by adding a new > install_pte() callback in mm_walk_ops. If this is specified, then upon > encountering a missing page table entry, we allocate and install a new one > and continue the traversal. > > If a THP huge page is encountered at either the PMD or PUD level we split > it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level), > otherwise if there is only an ops->install_pte(), we avoid the unnecessary > split. > > We do not support hugetlb at this stage. > > If this function returns an error, or an allocation fails during the > operation, we abort the operation altogether. It is up to the caller to > deal appropriately with partially populated page table ranges. > > If install_pte() is defined, the semantics of pte_entry() change - this > callback is then only invoked if the entry already exists. This is a useful > property, as it allows a caller to handle existing PTEs while installing > new ones where necessary in the specified range. > > If install_pte() is not defined, then there is no functional difference to > this patch, so all existing logic will work precisely as it did before. > > As we only permit the installation of PTEs where a mapping does not already > exist there is no need for TLB management, however we do invoke > update_mmu_cache() for architectures which require manual maintenance of > mappings for other CPUs. > > We explicitly do not allow the existing page walk API to expose this > feature as it is dangerous and intended for internal mm use only. Therefore > we provide a new walk_page_range_mm() function exposed only to > mm/internal.h. > > Reviewed-by: Jann Horn <jannh@google.com> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Just a small subjective suggestion in case you agree and there's a respin or followups: > @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, > > if (walk->action == ACTION_AGAIN) > goto again; > - > - /* > - * Check this here so we only break down trans_huge > - * pages when we _need_ to > - */ > - if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) || > - walk->action == ACTION_CONTINUE || > - !(ops->pte_entry)) > + if (walk->action == ACTION_CONTINUE) > continue; > + if (!ops->install_pte && !ops->pte_entry) > + continue; /* Nothing to do. */ > + if (!ops->pte_entry && ops->install_pte && > + pmd_present(*pmd) && > + (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))) > + continue; /* Avoid unnecessary split. */ Much better now, thanks, but maybe the last 2 parts could be: if (!ops->pte_entry) { if (!ops->install_pte) continue; /* Nothing to do. */ else if (pmd_present(*pmd) && (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))) continue; /* Avoid unnecessary split. */ } Or at least put !ops->pte_entry first in both conditions? > if (walk->vma) > split_huge_pmd(walk->vma, pmd, addr); > + else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) > + continue; /* Nothing to do. */ > > err = walk_pte_range(pmd, addr, next, walk); > if (err) > @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > again: > next = pud_addr_end(addr, end); > if (pud_none(*pud)) { > - if (ops->pte_hole) > + if (ops->install_pte) > + err = __pmd_alloc(walk->mm, pud, addr); > + else if (ops->pte_hole) > err = ops->pte_hole(addr, next, depth, walk); > if (err) > break; > - continue; > + if (!ops->install_pte) > + continue; > } > > walk->action = ACTION_SUBTREE; > @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > > if (walk->action == ACTION_AGAIN) > goto again; > - > - if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || > - walk->action == ACTION_CONTINUE || > - !(ops->pmd_entry || ops->pte_entry)) > + if (walk->action == ACTION_CONTINUE) > continue; > + if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry) > + continue; /* Nothing to do. */ > + if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte && > + pud_present(*pud) && > + (pud_trans_huge(*pud) || pud_devmap(*pud))) > + continue; /* Avoid unnecessary split. */ Ditto. Thanks!
>> >> We already discussed in the past that we need a better and more efficient >> way to walk page tables. I have part of that on my TODO list, but I'm >> getting distracted. > > Yes I remember an LSF session on this, it's a really obvious area of improvement > that stands out at the moment for sure. > > Having worked several 12+ hour days in a row though recently I can relate to > workload making this difficult though :) Yes :) > >> >> *Inserting* (not walking/modifying existing things as most users to) as done >> in this patch is slightly different though, likely "on thing that fits all" >> will not apply to all page table walker user cases. > > Yeah, there's also replace scenarios which then have to do egregious amounts of > work to make sure we do everything right, in fact there's duplicates of this in > mm/madvise.c *grumble grumble*. > >> >> -- >> Cheers, >> >> David / dhildenb >> > > OK so I guess I'll hold off my TODOs on this as you are looking in this area and > I trust you :) It will probably take me a while until I get to it, though. I'd focus on walking (and batching what we can) first, then on top modifying existing entries. The "install something where there is nothing yet" (incl. populating fresh page tables etc.) case probably deserves a separate "walker". If you end up having spare cycles and want to sync on a possible design for some part of that bigger task -- removing the old pagewalk logic -- please do reach out! :)
On Fri, Oct 25, 2024 at 08:13:26PM +0200, Vlastimil Babka wrote: > On 10/23/24 18:24, Lorenzo Stoakes wrote: > > The existing generic pagewalk logic permits the walking of page tables, > > invoking callbacks at individual page table levels via user-provided > > mm_walk_ops callbacks. > > > > This is useful for traversing existing page table entries, but precludes > > the ability to establish new ones. > > > > Existing mechanism for performing a walk which also installs page table > > entries if necessary are heavily duplicated throughout the kernel, each > > with semantic differences from one another and largely unavailable for use > > elsewhere. > > > > Rather than add yet another implementation, we extend the generic pagewalk > > logic to enable the installation of page table entries by adding a new > > install_pte() callback in mm_walk_ops. If this is specified, then upon > > encountering a missing page table entry, we allocate and install a new one > > and continue the traversal. > > > > If a THP huge page is encountered at either the PMD or PUD level we split > > it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level), > > otherwise if there is only an ops->install_pte(), we avoid the unnecessary > > split. > > > > We do not support hugetlb at this stage. > > > > If this function returns an error, or an allocation fails during the > > operation, we abort the operation altogether. It is up to the caller to > > deal appropriately with partially populated page table ranges. > > > > If install_pte() is defined, the semantics of pte_entry() change - this > > callback is then only invoked if the entry already exists. This is a useful > > property, as it allows a caller to handle existing PTEs while installing > > new ones where necessary in the specified range. > > > > If install_pte() is not defined, then there is no functional difference to > > this patch, so all existing logic will work precisely as it did before. > > > > As we only permit the installation of PTEs where a mapping does not already > > exist there is no need for TLB management, however we do invoke > > update_mmu_cache() for architectures which require manual maintenance of > > mappings for other CPUs. > > > > We explicitly do not allow the existing page walk API to expose this > > feature as it is dangerous and intended for internal mm use only. Therefore > > we provide a new walk_page_range_mm() function exposed only to > > mm/internal.h. > > > > Reviewed-by: Jann Horn <jannh@google.com> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Thanks! > > Just a small subjective suggestion in case you agree and there's a respin or > followups: > > > @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, > > > > if (walk->action == ACTION_AGAIN) > > goto again; > > - > > - /* > > - * Check this here so we only break down trans_huge > > - * pages when we _need_ to > > - */ > > - if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) || > > - walk->action == ACTION_CONTINUE || > > - !(ops->pte_entry)) > > + if (walk->action == ACTION_CONTINUE) > > continue; > > + if (!ops->install_pte && !ops->pte_entry) > > + continue; /* Nothing to do. */ > > + if (!ops->pte_entry && ops->install_pte && > > + pmd_present(*pmd) && > > + (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))) > > + continue; /* Avoid unnecessary split. */ > > Much better now, thanks, but maybe the last 2 parts could be: > > if (!ops->pte_entry) { > if (!ops->install_pte) > continue; /* Nothing to do. */ > else if (pmd_present(*pmd) > && (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))) > continue; /* Avoid unnecessary split. */ > } I quite liked separating out the 'nothing to do' vs. the 'unnecessary split' cases, but I agree it makes it harder to see that the 2nd case is an 'install pte ONLY' case. Yeah so I think your version is better, but maybe we can find a way to be more expressive somehow... if we could declare vars mid-way thhrough it'd be easier :P Will improve on respin if it comes up > > Or at least put !ops->pte_entry first in both conditions? Ack yeah that'd be better! > > > if (walk->vma) > > split_huge_pmd(walk->vma, pmd, addr); > > + else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) > > + continue; /* Nothing to do. */ > > > > err = walk_pte_range(pmd, addr, next, walk); > > if (err) > > @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > > again: > > next = pud_addr_end(addr, end); > > if (pud_none(*pud)) { > > - if (ops->pte_hole) > > + if (ops->install_pte) > > + err = __pmd_alloc(walk->mm, pud, addr); > > + else if (ops->pte_hole) > > err = ops->pte_hole(addr, next, depth, walk); > > if (err) > > break; > > - continue; > > + if (!ops->install_pte) > > + continue; > > } > > > > walk->action = ACTION_SUBTREE; > > @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > > > > if (walk->action == ACTION_AGAIN) > > goto again; > > - > > - if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || > > - walk->action == ACTION_CONTINUE || > > - !(ops->pmd_entry || ops->pte_entry)) > > + if (walk->action == ACTION_CONTINUE) > > continue; > > + if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry) > > + continue; /* Nothing to do. */ > > + if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte && > > + pud_present(*pud) && > > + (pud_trans_huge(*pud) || pud_devmap(*pud))) > > + continue; /* Avoid unnecessary split. */ > > Ditto. Ack! > > Thanks! > Cheers!